ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.02304
  4. Cited By
A Token-level Text Image Foundation Model for Document Understanding
v1v2 (latest)

A Token-level Text Image Foundation Model for Document Understanding

4 March 2025
Tongkun Guan
Zining Wang
Pei Fu
Zhengtao Guo
Wei Shen
Kai Zhou
Tiezhu Yue
Chen Duan
Hao Sun
Qianyi Jiang
Junfeng Luo
Yunbo Wang
    VLM
ArXiv (abs)PDFHTMLHuggingFace (4 upvotes)

Papers citing "A Token-level Text Image Foundation Model for Document Understanding"

50 / 95 papers shown
Title
MGA-VQA: Secure and Interpretable Graph-Augmented Visual Question Answering with Memory-Guided Protection Against Unauthorized Knowledge Use
MGA-VQA: Secure and Interpretable Graph-Augmented Visual Question Answering with Memory-Guided Protection Against Unauthorized Knowledge Use
Ahmad Mohammadshirazi
Pinaki Prasad Guha Neogi
Dheeraj Kulshrestha
R. Ramnath
60
0
0
22 Nov 2025
LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?
LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?
Maoyuan Ye
Jing Zhang
Juhua Liu
Bo Du
Dacheng Tao
Bo Du
LRM
426
1
0
18 May 2025
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced
  Multimodal Understanding
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Z. F. Wu
Xiaokang Chen
Zizheng Pan
Xianglong Liu
Wen Liu
...
Xingkai Yu
Haowei Zhang
Bo Pan
Yijiao Wang
Chong Ruan
MLLMVLMMoE
361
374
0
13 Dec 2024
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Weiyun Wang
Zhe Chen
Wenhai Wang
Yue Cao
Yangzhou Liu
...
Jinguo Zhu
X. Zhu
Lewei Lu
Yu Qiao
Jifeng Dai
LRM
458
174
1
15 Nov 2024
Locality Alignment Improves Vision-Language Models
Locality Alignment Improves Vision-Language ModelsInternational Conference on Learning Representations (ICLR), 2024
Ian Covert
Tony Sun
James Zou
Tatsunori Hashimoto
VLM
517
11
0
14 Oct 2024
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Haotian Zhang
Mingfei Gao
Zhe Gan
Philipp Dufter
Nina Wenzel
...
Haoxuan You
Zirui Wang
Afshin Dehghan
Peter Grasch
Yinfei Yang
VLMMLLM
275
64
1
30 Sep 2024
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page
  Document Understanding
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document UnderstandingAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Anwen Hu
Haiyang Xu
Liang Zhang
Jiabo Ye
Ming Yan
Ji Zhang
Qin Jin
Fei Huang
Jingren Zhou
VLM
279
76
0
05 Sep 2024
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Min Shi
Fuxiao Liu
Shihao Wang
Shijia Liao
Subhashree Radhakrishnan
...
Andrew Tao
Andrew Tao
Zhiding Yu
Guilin Liu
Guilin Liu
MLLM
354
109
0
28 Aug 2024
DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding
DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document UnderstandingComputer Vision and Pattern Recognition (CVPR), 2024
Wenhui Liao
Jiapeng Wang
Hongliang Li
Chengyu Wang
Jun Huang
Lianwen Jin
459
0
0
27 Aug 2024
Building and better understanding vision-language models: insights and
  future directions
Building and better understanding vision-language models: insights and future directions
Hugo Laurençon
Andrés Marafioti
Victor Sanh
Léo Tronchon
VLM
270
127
0
22 Aug 2024
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li
Yuanhan Zhang
Dong Guo
Renrui Zhang
Feng Li
Hao Zhang
Kaichen Zhang
Yanwei Li
Ziwei Liu
Chunyuan Li
MLLMSyDaVLM
447
1,651
0
06 Aug 2024
Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language
  Models
Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language ModelsInternational Conference on Learning Representations (ICLR), 2024
Mingxin Huang
Yuliang Liu
Dingkang Liang
Lianwen Jin
Xiang Bai
233
2
0
04 Aug 2024
Token-level Correlation-guided Compression for Efficient Multimodal
  Document Understanding
Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding
Renshan Zhang
Yibo Lyu
Rui Shao
Gongwei Chen
Weili Guan
Liqiang Nie
166
18
0
19 Jul 2024
PosFormer: Recognizing Complex Handwritten Mathematical Expression with
  Position Forest Transformer
PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer
Tongkun Guan
Chengyu Lin
Wei Shen
Xiaokang Yang
231
15
0
10 Jul 2024
TokenPacker: Efficient Visual Projector for Multimodal LLM
TokenPacker: Efficient Visual Projector for Multimodal LLM
Wentong Li
Yuqian Yuan
Jian Liu
Dongqi Tang
Song Wang
Jie Qin
Jianke Zhu
Lei Zhang
MLLM
338
115
0
02 Jul 2024
A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding
A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding
Jinghui Lu
Haiyang Yu
Yanjie Wang
Yongjie Ye
Jingqun Tang
...
Qi Liu
Hao Feng
Han Wang
Hao Liu
Can Huang
523
32
0
02 Jul 2024
DocKylin: A Large Multimodal Model for Visual Document Understanding
  with Efficient Visual Slimming
DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
Jiaxin Zhang
Wentao Yang
Songxuan Lai
Zecheng Xie
Lianwen Jin
323
27
0
27 Jun 2024
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Shengbang Tong
Ellis L Brown
Penghao Wu
Sanghyun Woo
Manoj Middepogu
...
Xichen Pan
Austin Wang
Rob Fergus
Yann LeCun
Saining Xie
3DVMLLM
311
603
0
24 Jun 2024
DocGenome: An Open Large-scale Scientific Document Benchmark for
  Training and Testing Multi-modal Large Language Models
DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models
Renqiu Xia
Song Mao
Xiangchao Yan
Hongbin Zhou
Bo Zhang
...
Yongwei Wang
Bin Wang
Junchi Yan
Fei Wu
Yu Qiao
204
21
0
17 Jun 2024
What matters when building vision-language models?
What matters when building vision-language models?Neural Information Processing Systems (NeurIPS), 2024
Hugo Laurençon
Léo Tronchon
Matthieu Cord
Victor Sanh
VLM
248
270
0
03 May 2024
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
  Models with Open-Source Suites
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen
Weiyun Wang
Hao Tian
Shenglong Ye
Zhangwei Gao
...
Tong Lu
Dahua Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
MLLMVLM
438
955
0
25 Apr 2024
HRVDA: High-Resolution Visual Document Assistant
HRVDA: High-Resolution Visual Document AssistantComputer Vision and Pattern Recognition (CVPR), 2024
Chaohu Liu
Kun Yin
Haoyu Cao
Xinghua Jiang
Xin Li
Yinsong Liu
Deqiang Jiang
Xing Sun
Linli Xu
VLM
227
30
0
10 Apr 2024
LayoutLLM: Layout Instruction Tuning with Large Language Models for
  Document Understanding
LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding
Chuwei Luo
Yufan Shen
Zhaoqing Zhu
Qi Zheng
Zhi Yu
Cong Yao
297
88
0
08 Apr 2024
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document
  Understanding
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
Anwen Hu
Haiyang Xu
Jiabo Ye
Mingshi Yan
Liang Zhang
...
Chen Li
Ji Zhang
Qin Jin
Fei Huang
Jingren Zhou
VLM
241
186
0
19 Mar 2024
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Brandon McKinzie
Zhe Gan
J. Fauconnier
Sam Dodge
Bowen Zhang
...
Zirui Wang
Ruoming Pang
Peter Grasch
Alexander Toshev
Yinfei Yang
MLLM
423
242
0
14 Mar 2024
Unlocking the conversion of Web Screenshots into HTML Code with the
  WebSight Dataset
Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset
Hugo Laurençon
Léo Tronchon
Victor Sanh
VLM
185
72
0
14 Mar 2024
MoAI: Mixture of All Intelligence for Large Language and Vision Models
MoAI: Mixture of All Intelligence for Large Language and Vision ModelsEuropean Conference on Computer Vision (ECCV), 2024
Byung-Kwan Lee
Beomchan Park
Chae Won Kim
Yonghyun Ro
MLLMVLM
327
31
0
12 Mar 2024
TextMonkey: An OCR-Free Large Multimodal Model for Understanding
  Document
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
Yuliang Liu
Biao Yang
Qiang Liu
Zhang Li
Zhiyin Ma
Shuo Zhang
Xiang Bai
MLLMVLM
259
144
0
07 Mar 2024
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist
  Autonomous Agents for Desktop and Web
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
Raghav Kapoor
Y. Butala
M. Russak
Jing Yu Koh
Kiran Kamble
Waseem Alshikh
Ruslan Salakhutdinov
LLMAG
427
100
0
27 Feb 2024
Hi-SAM: Marrying Segment Anything Model for Hierarchical Text
  Segmentation
Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation
Maoyuan Ye
Jing Zhang
Juhua Liu
Chenyu Liu
Baocai Yin
Cong Liu
Bo Du
Dacheng Tao
VLM
178
30
0
31 Jan 2024
MouSi: Poly-Visual-Expert Vision-Language Models
MouSi: Poly-Visual-Expert Vision-Language Models
Xiaoran Fan
Changzhi Sun
Changhao Jiang
Shuo Li
Senjie Jin
...
Tao Gui
Xipeng Qiu
Xuanjing Huang
Zuxuan Wu
Yunchun Jiang
VLM
120
24
0
30 Jan 2024
InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document
  Understanding with Instructions
InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with InstructionsAAAI Conference on Artificial Intelligence (AAAI), 2024
Ryota Tanaka
Taichi Iki
Kyosuke Nishida
Kuniko Saito
Jun Suzuki
VLM
198
33
0
24 Jan 2024
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI AgentsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Kanzhi Cheng
Qiushi Sun
Yougang Chu
Fangzhi Xu
Yantao Li
Jianbing Zhang
Zhiyong Wu
LLMAG
579
339
0
17 Jan 2024
DocLLM: A layout-aware generative language model for multimodal document
  understanding
DocLLM: A layout-aware generative language model for multimodal document understandingAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Dongsheng Wang
Natraj Raman
Mathieu Sibue
Zhiqiang Ma
Petr Babkin
Simerjot Kaur
Yulong Pei
Armineh Nourbakhsh
Xiaomo Liu
VLM
216
99
0
31 Dec 2023
InternVL: Scaling up Vision Foundation Models and Aligning for Generic
  Visual-Linguistic Tasks
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen
Jiannan Wu
Wenhai Wang
Weijie Su
Guo Chen
...
Bin Li
Ping Luo
Tong Lu
Yu Qiao
Jifeng Dai
VLMMLLM
530
2,092
0
21 Dec 2023
Bridging Synthetic and Real Worlds for Pre-training Scene Text Detectors
Bridging Synthetic and Real Worlds for Pre-training Scene Text DetectorsEuropean Conference on Computer Vision (ECCV), 2023
Tongkun Guan
Wei Shen
Xuehang Yang
Xuehui Wang
Yunbo Wang
268
8
0
08 Dec 2023
DocPedia: Unleashing the Power of Large Multimodal Model in the
  Frequency Domain for Versatile Document Understanding
DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding
Hao Feng
Qi Liu
Hao Liu
Wen-gang Zhou
Houqiang Li
Can Huang
VLM
261
91
0
20 Nov 2023
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for
  Multi-modal Large Language Models
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
Ziyi Lin
Chris Liu
Renrui Zhang
Shiyang Feng
Longtian Qiu
...
Siyuan Huang
Yichi Zhang
Xuming He
Jiaming Song
Yu Qiao
MLLMVLM
264
270
0
13 Nov 2023
SpectralGPT: Spectral Remote Sensing Foundation Model
SpectralGPT: Spectral Remote Sensing Foundation ModelIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Danfeng Hong
Bing Zhang
Xuyang Li
Yuxuan Li
Chenyu Li
...
Xiuping Jia
Antonio J. Plaza
Paolo Gamba
J. Benediktsson
J. Chanussot
227
615
0
13 Nov 2023
Monkey: Image Resolution and Text Label Are Important Things for Large
  Multi-modal Models
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal ModelsComputer Vision and Pattern Recognition (CVPR), 2023
Zhang Li
Biao Yang
Qiang Liu
Zhiyin Ma
Shuo Zhang
Jingxu Yang
Yabo Sun
Yuliang Liu
Xiang Bai
MLLM
401
370
0
11 Nov 2023
UReader: Universal OCR-free Visually-situated Language Understanding
  with Multimodal Large Language Model
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language ModelConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Jiabo Ye
Anwen Hu
Haiyang Xu
Qinghao Ye
Mingshi Yan
...
Ji Zhang
Qin Jin
Liang He
Xin Lin
Feiyan Huang
VLMMLLM
292
124
0
08 Oct 2023
Improved Baselines with Visual Instruction Tuning
Improved Baselines with Visual Instruction TuningComputer Vision and Pattern Recognition (CVPR), 2023
Haotian Liu
Chunyuan Li
Yuheng Li
Yong Jae Lee
VLMMLLM
548
3,997
0
05 Oct 2023
Kosmos-2.5: A Multimodal Literate Model
Kosmos-2.5: A Multimodal Literate Model
Tengchao Lv
Yupan Huang
Jingye Chen
Lei Cui
Shuming Ma
...
Weiyao Luo
Shaoxiang Wu
Guoxin Wang
Cha Zhang
Furu Wei
VLMMLLM
228
83
0
20 Sep 2023
UniDoc: A Universal Large Multimodal Model for Simultaneous Text
  Detection, Recognition, Spotting and Understanding
UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding
Hao Feng
Zijian Wang
Jingqun Tang
Jinghui Lu
Wen-gang Zhou
Houqiang Li
Can Huang
MLLMVLM
307
57
0
19 Aug 2023
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image
  Understanding
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
Yanzhe Zhang
Ruiyi Zhang
Jiuxiang Gu
Jiuxiang Gu
Nedim Lipka
Diyi Yang
Tongfei Sun
VLMMLLM
240
279
0
29 Jun 2023
Faster Segment Anything: Towards Lightweight SAM for Mobile Applications
Faster Segment Anything: Towards Lightweight SAM for Mobile Applications
Chaoning Zhang
Dongshen Han
Yu Qiao
Jung Uk Kim
Sung-Ho Bae
Seungkyu Lee
Choong Seon Hong
VLM
332
482
0
25 Jun 2023
Mind2Web: Towards a Generalist Agent for the Web
Mind2Web: Towards a Generalist Agent for the WebNeural Information Processing Systems (NeurIPS), 2023
Xiang Deng
Yu Gu
Boyuan Zheng
Shijie Chen
Samuel Stevens
Boshi Wang
Huan Sun
Yu-Chuan Su
LLMAG
420
741
0
09 Jun 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Judging LLM-as-a-Judge with MT-Bench and Chatbot ArenaNeural Information Processing Systems (NeurIPS), 2023
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
...
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALMOSLMELM
2.3K
6,293
0
09 Jun 2023
PaLI-X: On Scaling up a Multilingual Vision and Language Model
PaLI-X: On Scaling up a Multilingual Vision and Language Model
Xi Chen
Josip Djolonga
Piotr Padlewski
Basil Mustafa
Soravit Changpinyo
...
Mojtaba Seyedhosseini
A. Angelova
Xiaohua Zhai
N. Houlsby
Radu Soricut
VLM
310
246
0
29 May 2023
CCpdf: Building a High Quality Corpus for Visually Rich Documents from
  Web Crawl Data
CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl DataIEEE International Conference on Document Analysis and Recognition (ICDAR), 2023
M. Turski
Tomasz Stanislawek
Karol Kaczmarek
Pawel Dyda
Filip Graliñski
228
15
0
28 Apr 2023
12
Next