ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.02304
41
0

A Token-level Text Image Foundation Model for Document Understanding

4 March 2025
Tongkun Guan
Zining Wang
Pei Fu
Zhengtao Guo
Wei-Ming Shen
Kai Zhou
Tiezhu Yue
Chen Duan
Hao Sun
Qianyi Jiang
Junfeng Luo
Xiaokang Yang
    VLM
ArXivPDFHTML
Abstract

In recent years, general visual foundation models (VFMs) have witnessed increasing adoption, particularly as image encoders for popular multi-modal large language models (MLLMs). However, without semantically fine-grained supervision, these models still encounter fundamental prediction errors in the context of downstream text-image-related tasks, i.e., perception, understanding and reasoning with images containing small and dense texts. To bridge this gap, we develop TokenOCR, the first token-level visual foundation model specifically tailored for text-image-related tasks, designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR, we also devise a high-quality data production pipeline that constructs the first token-level image text dataset, TokenIT, comprising 20 million images and 1.8 billion token-mask pairs. Furthermore, leveraging this foundation with exceptional image-as-text capability, we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, TokenVL, for VQA-based document understanding tasks. Finally, extensive experiments demonstrate the effectiveness of TokenOCR and TokenVL. Code, datasets, and weights will be available atthis https URL.

View on arXiv
@article{guan2025_2503.02304,
  title={ A Token-level Text Image Foundation Model for Document Understanding },
  author={ Tongkun Guan and Zining Wang and Pei Fu and Zhengtao Guo and Wei Shen and Kai Zhou and Tiezhu Yue and Chen Duan and Hao Sun and Qianyi Jiang and Junfeng Luo and Xiaokang Yang },
  journal={arXiv preprint arXiv:2503.02304},
  year={ 2025 }
}
Comments on this paper