Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2505.11015
Cited By
v1
v2 (latest)
WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?
16 May 2025
An-Lan Wang
Jingqun Tang
Liao Lei
Hao Feng
Qi Liu
Xiang Fei
Jinghui Lu
Han Wang
Wen Liu
Hao Liu
Wenshu Fan
Xiang Bai
Can Huang
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?"
32 / 32 papers shown
Title
DocPTBench: Benchmarking End-to-End Photographed Document Parsing and Translation
Yongkun Du
Pinxuan Chen
Xuye Ying
Z. Chen
48
0
0
23 Nov 2025
Prolonged Reasoning Is Not All You Need: Certainty-Based Adaptive Routing for Efficient LLM/MLLM Reasoning
Jinghui Lu
Haiyang Yu
Siliang Xu
Shiwei Ran
Guozhi Tang
...
Teng Fu
Hao Feng
Jingqun Tang
Hongru Wang
Can Huang
LRM
325
12
0
21 May 2025
Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Hao Feng
Shu Wei
Xiang Fei
Wei Shi
Yingdong Han
...
Qi Liu
Chunhui Lin
Jingqun Tang
Hao Liu
Can Huang
251
15
0
20 May 2025
Advancing Sequential Numerical Prediction in Autoregressive Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Xiang Fei
Jinghui Lu
Qi Sun
Hao Feng
Yanjie Wang
Wei Shi
An-Lan Wang
Jingqun Tang
Can Huang
AI4TS
483
4
0
19 May 2025
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen
Zhiyu Wu
Xingchao Liu
Zizheng Pan
Wen Liu
Zhenda Xie
X. Yu
Chong Ruan
AI4TS
393
413
0
29 Jan 2025
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
Ling Fu
Biao Yang
Zhebin Kuang
Jiajun Song
Yuzhe Li
...
Jingqun Tang
Wei Chen
Lianwen Jin
Yunxing Liu
Xiang Bai
279
58
0
31 Dec 2024
MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark
Bin Shan
Xiang Fei
Wei Shi
An-Lan Wang
Guozhi Tang
Lei Liao
Jingqun Tang
Xiang Bai
Can Huang
VLM
165
7
0
15 Oct 2024
ParGo: Bridging Vision-Language with Partial and Global Views
AAAI Conference on Artificial Intelligence (AAAI), 2024
An-Lan Wang
Bin Shan
Wei Shi
Kun-Yu Lin
Xiang Fei
Guozhi Tang
Lei Liao
Jingqun Tang
Can Huang
Wei-Shi Zheng
MLLM
VLM
423
21
0
23 Aug 2024
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li
Yuanhan Zhang
Dong Guo
Renrui Zhang
Feng Li
Hao Zhang
Kaichen Zhang
Yanwei Li
Ziwei Liu
Chunyuan Li
MLLM
SyDa
VLM
367
1,631
0
06 Aug 2024
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao
Tianyu Yu
Ao Zhang
Chongyi Wang
Junbo Cui
...
Xu Han
Guoyang Zeng
Dahai Li
Zhiyuan Liu
Maosong Sun
VLM
MLLM
341
830
0
03 Aug 2024
Harmonizing Visual Text Comprehension and Generation
Zhen Zhao
Jingqun Tang
Binghong Wu
Chunhui Lin
Shubo Wei
Hao Liu
Xin Tan
Zhizhong Zhang
Can Huang
Yuan Xie
VLM
240
35
0
23 Jul 2024
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
Haodong Duan
Xinyu Fang
Junming Yang
Xiangyu Zhao
Lin Chen
...
Yuhang Zang
Pan Zhang
Jiaqi Wang
Dahua Lin
Kai Chen
LM&MA
VLM
598
332
0
16 Jul 2024
A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding
Jinghui Lu
Haiyang Yu
Yanjie Wang
Yongjie Ye
Jingqun Tang
...
Qi Liu
Hao Feng
Han Wang
Hao Liu
Can Huang
503
32
0
02 Jul 2024
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Team GLM
:
Aohan Zeng
Bin Xu
Bowen Wang
...
Zhaoyu Wang
Zhen Yang
Zhengxiao Du
Zhenyu Hou
Zihan Wang
ALM
286
1,109
0
18 Jun 2024
TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy
Weichao Zhao
Hao Feng
Qi Liu
Jingqun Tang
Shubo Wei
...
Lei Liao
Yongjie Ye
Hao Liu
Houqiang Li
Can Huang
LMTD
229
45
0
03 Jun 2024
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering
Jingqun Tang
Qi-dong Liu
Yongjie Ye
Jinghui Lu
Shubo Wei
...
Hao Liu
Xiang Bai
Can Huang
Xiang Bai
Can Huang
654
46
0
20 May 2024
TableVQA-Bench: A Visual Question Answering Benchmark on Multiple Table Domains
Yoonsik Kim
Moonbin Yim
Ka Yeon Song
LMTD
272
40
0
30 Apr 2024
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin
Sam Ade Jacobs
A. A. Awan
J. Aneja
Ahmed Hassan Awadallah
...
Li Zhang
Yi Zhang
Yue Zhang
Yunan Zhang
Xiren Zhou
LRM
ALM
465
1,811
0
22 Apr 2024
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Jingqun Tang
Chunhui Lin
Zhen Zhao
Shubo Wei
Binghong Wu
...
Yuliang Liu
Xiang Bai
Can Huang
Xiang Bai
Can Huang
LRM
VLM
MLLM
376
41
0
19 Apr 2024
HRVDA: High-Resolution Visual Document Assistant
Computer Vision and Pattern Recognition (CVPR), 2024
Chaohu Liu
Kun Yin
Haoyu Cao
Xinghua Jiang
Xin Li
Yinsong Liu
Deqiang Jiang
Xing Sun
Linli Xu
VLM
227
30
0
10 Apr 2024
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
Neural Information Processing Systems (NeurIPS), 2024
Xiao-wen Dong
Pan Zhang
Yuhang Zang
Yuhang Cao
Bin Wang
...
Xingcheng Zhang
Jifeng Dai
Yuxin Qiao
Dahua Lin
Yuan Liu
VLM
MLLM
208
159
0
09 Apr 2024
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
Yuliang Liu
Biao Yang
Qiang Liu
Zhang Li
Zhiyin Ma
Shuo Zhang
Xiang Bai
MLLM
VLM
251
144
0
07 Mar 2024
Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer
Computer Vision and Pattern Recognition (CVPR), 2023
Zhen Zhao
Jingqun Tang
Chunhui Lin
Binghong Wu
Can Huang
Hao Liu
Xin Tan
Zhizhong Zhang
Yuan Xie
352
37
0
22 Nov 2023
DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding
Hao Feng
Qi Liu
Hao Liu
Wen-gang Zhou
Houqiang Li
Can Huang
VLM
253
89
0
20 Nov 2023
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
Computer Vision and Pattern Recognition (CVPR), 2023
Zhang Li
Biao Yang
Qiang Liu
Zhiyin Ma
Shuo Zhang
Jingxu Yang
Yabo Sun
Yuliang Liu
Xiang Bai
MLLM
401
367
0
11 Nov 2023
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Jiabo Ye
Anwen Hu
Haiyang Xu
Qinghao Ye
Mingshi Yan
...
Chenliang Li
Junfeng Tian
Qiang Qi
Ji Zhang
Feiyan Huang
VLM
MLLM
169
154
0
04 Jul 2023
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
Yanzhe Zhang
Ruiyi Zhang
Jiuxiang Gu
Jiuxiang Gu
Nedim Lipka
Diyi Yang
Tongfei Sun
VLM
MLLM
232
277
0
29 Jun 2023
Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective
Computer Vision and Pattern Recognition (CVPR), 2023
Weixia Zhang
Guangtao Zhai
Ying Wei
Yunbo Wang
Kede Ma
VLM
160
274
0
27 Mar 2023
GPT-4 Technical Report
OpenAI OpenAI
OpenAI Josh Achiam
Steven Adler
Sandhini Agarwal
Lama Ahmad
...
Shengjia Zhao
Tianhao Zheng
Juntang Zhuang
William Zhuk
Barret Zoph
LLMAG
MLLM
3.3K
20,007
0
15 Mar 2023
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Findings (Findings), 2022
Ahmed Masry
Do Xuan Long
J. Tan
Shafiq Joty
Enamul Hoque
AIMat
375
1,061
0
19 Mar 2022
DocVQA: A Dataset for VQA on Document Images
Minesh Mathew
Dimosthenis Karatzas
C. V. Jawahar
546
1,048
0
01 Jul 2020
A Diagram Is Worth A Dozen Images
Aniruddha Kembhavi
M. Salvato
Eric Kolve
Minjoon Seo
Hannaneh Hajishirzi
Ali Farhadi
3DV
204
729
0
24 Mar 2016
1