ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2404.12803
32
26

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

19 April 2024
Jingqun Tang
Chunhui Lin
Zhen Zhao
Shubo Wei
Binghong Wu
Qi Liu
Hao Feng
Yang Li
Siqi Wang
Lei Liao
Wei Shi
Yuliang Liu
Hao Liu
Yuan Xie
Xiang Bai
Can Huang
    LRM
    VLM
    MLLM
ArXivPDFHTML
Abstract

Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%). It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks. 2) Additionally, we demonstrate the critical role of VQA reasoning data in offering comprehensive contextual insights for specific questions. This not only improves accuracy but also significantly mitigates hallucinations. Specifically, TextSquare scores an average of 75.1% across four general VQA and hallucination evaluation datasets, outperforming previous state-of-the-art models. 3) Notably, the phenomenon observed in scaling text-centric VQA datasets reveals a vivid pattern: the exponential increase of instruction tuning data volume is directly proportional to the improvement in model performance, thereby validating the necessity of the dataset scale and the high quality of Square-10M.

View on arXiv
@article{tang2025_2404.12803,
  title={ TextSquare: Scaling up Text-Centric Visual Instruction Tuning },
  author={ Jingqun Tang and Chunhui Lin and Zhen Zhao and Shu Wei and Binghong Wu and Qi Liu and Hao Feng and Yang Li and Siqi Wang and Lei Liao and Wei Shi and Yuliang Liu and Hao Liu and Yuan Xie and Xiang Bai and Can Huang },
  journal={arXiv preprint arXiv:2404.12803},
  year={ 2025 }
}
Comments on this paper