Despite the existing evolution of Multimodal Large Language Models (MLLMs), a non-neglectable limitation remains in their struggle with visual text grounding, especially in text-rich images of documents. Document images, such as scanned forms and infographics, highlight critical challenges due to their complex layouts and textual content. However, current benchmarks do not fully address these challenges, as they mostly focus on visual grounding on natural images, rather than text-rich document images. Thus, to bridge this gap, we introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking and improving the Text-Rich Image Grounding capabilities of MLLMs in document question-answering. Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark and a large-scale training set of 90syntheticdatabasedonfourdiversedatasets.AcomprehensiveevaluationofvariousMLLMsonourproposedbenchmarkexposessubstantiallimitationsintheirgroundingcapabilityontext−richimages.Inaddition,weproposetwosimpleandeffectiveTRIGmethodsbasedongeneralinstructiontuningandplug−and−playefficientembedding,respectively.ByfinetuningMLLMsonoursyntheticdataset,theypromisinglyimprovespatialreasoningandgroundingcapabilities.
@article{li2025_2504.04974,
title={ Towards Visual Text Grounding of Multimodal Large Language Model },
author={ Ming Li and Ruiyi Zhang and Jian Chen and Jiuxiang Gu and Yufan Zhou and Franck Dernoncourt and Wanrong Zhu and Tianyi Zhou and Tong Sun },
journal={arXiv preprint arXiv:2504.04974},
year={ 2025 }
}