Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2007.00398
Cited By
v1
v2
v3 (latest)
DocVQA: A Dataset for VQA on Document Images
1 July 2020
Minesh Mathew
Dimosthenis Karatzas
C. V. Jawahar
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (2 upvotes)
Papers citing
"DocVQA: A Dataset for VQA on Document Images"
50 / 759 papers shown
Interpret, prune and distill Donut : towards lightweight VLMs for VQA on document
Adnan Ben Mansour
Ayoub Karine
D. Naccache
138
0
0
30 Sep 2025
Defeating Cerberus: Concept-Guided Privacy-Leakage Mitigation in Multimodal Language Models
Boyang Zhang
Istemi Ekin Akkus
Ruichuan Chen
Alice Dethise
Klaus Satzke
Ivica Rimac
Yang Zhang
PILM
195
0
0
29 Sep 2025
From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models
Chenyue Zhou
Mingxuan Wang
Yanbiao Ma
Chenxu Wu
Wanyi Chen
...
Guoli Jia
Lingling Li
Z. Lu
Y. Lu
Wenhan Luo
LRM
455
10
0
29 Sep 2025
OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding
Jiancong Xie
Wenjin Wang
Zhuomeng Zhang
Zihan Liu
Qi Liu
Ke Feng
Zixun Sun
Yuedong Yang
VLM
90
0
0
29 Sep 2025
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
Xiang An
Yin Xie
Kaicheng Yang
Wenkang Zhang
X. Zhao
...
Ziyong Feng
Ziwei Liu
Bo Li
Jiankang Deng
Jiankang Deng
MLLM
VLM
SyDa
372
46
0
28 Sep 2025
Visual CoT Makes VLMs Smarter but More Fragile
Chunxue Xu
Yiwei Wang
Yujun Cai
Bryan Hooi
Songze Li
MLLM
LRM
150
0
0
28 Sep 2025
RIV: Recursive Introspection Mask Diffusion Vision Language Model
YuQian Li
Limeng Qiao
Lin Ma
VLM
85
1
0
28 Sep 2025
LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models
Shubhang Bhatnagar
Andy Xu
Kar-Han Tan
Narendra Ahuja
MQ
215
0
0
28 Sep 2025
Uncovering Intrinsic Capabilities: A Paradigm for Data Curation in Vision-Language Models
Junjie Li
Ziao Wang
Jianghong Ma
Xiaofeng Zhang
127
0
0
27 Sep 2025
SynDoc: A Hybrid Discriminative-Generative Framework for Enhancing Synthetic Domain-Adaptive Document Key Information Extraction
Yihao Ding
Soyeon Caren Han
Yanbei Jiang
Yan Li
Zechuan Li
Yifan Peng
SyDa
105
0
0
27 Sep 2025
Chimera: Diagnosing Shortcut Learning in Visual-Language Understanding
Ziheng Chi
Yifan Hou
Chenxi Pang
Shaobo Cui
Mubashara Akhtar
Mrinmaya Sachan
143
0
0
26 Sep 2025
Rule-Based Reinforcement Learning for Document Image Classification with Vision Language Models
Michael Jungo
Andreas Fischer
87
0
0
26 Sep 2025
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
Sicong Leng
Jing Wang
Jiaxi Li
Hao Zhang
Zhiqiang Hu
...
Deli Zhao
Wei Lu
Yu Rong
Aixin Sun
Shijian Lu
OffRL
LRM
130
11
0
25 Sep 2025
DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning
Tianrun Xu
Haoda Jing
Y. Li
Yuquan Wei
Jun Feng
Guanyu Chen
Haichuan Gao
Tianren Zhang
Feng Chen
OffRL
102
0
0
25 Sep 2025
TABLET: A Large-Scale Dataset for Robust Visual Table Understanding
Iñigo Alonso
Imanol Miranda
Eneko Agirre
Mirella Lapata
LMTD
VLM
395
0
0
25 Sep 2025
CHURRO: Making History Readable with an Open-Weight Large Vision-Language Model for High-Accuracy, Low-Cost Historical Text Recognition
Sina J. Semnani
Han Zhang
Xinyan He
Merve Tekgürler
Monica S. Lam
3DV
137
0
0
24 Sep 2025
A Simple Data Augmentation Strategy for Text-in-Image Scientific VQA
Belal Shoer
Yova Kementchedjhieva
50
0
0
24 Sep 2025
Rule Encoding and Compliance in Large Language Models: An Information-Theoretic Analysis
Joachim Diederich
204
0
0
23 Sep 2025
Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards
Honghao Chen
Xingzhou Lou
Xiaokun Feng
Kaiqi Huang
Xinlong Wang
OffRL
LRM
191
1
0
23 Sep 2025
Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception
Yuheng Shi
Xiaohuan Pei
Minjing Dong
Chang Xu
ObjD
269
0
0
21 Sep 2025
Qianfan-VL: Domain-Enhanced Universal Vision-Language Models
Daxiang Dong
Mingming Zheng
Dong Xu
Bairong Zhuang
W. Zhang
...
Ruchang Yao
Ziye Yuan
J. Wu
Guangjun Xie
Dou Shen
VLM
99
1
0
19 Sep 2025
Pointing to a Llama and Call it a Camel: On the Sycophancy of Multimodal Large Language Models
Renjie Pi
Kehao Miao
Li Peihang
Runtao Liu
Lei Li
Jipeng Zhang
Xiaofang Zhou
LRM
105
2
0
19 Sep 2025
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
Yanghao Li
Rui Qian
Bowen Pan
Haotian Zhang
Haoshuo Huang
...
Zhengdong Zhang
Chen Chen
Yang Zhao
Ruoming Pang
Zhifeng Chen
MLLM
205
4
0
19 Sep 2025
Towards Rationale-Answer Alignment of LVLMs via Self-Rationale Calibration
Yuanchen Wu
Ke Yan
Shouhong Ding
Ziyin Zhou
Xiaoqiang Li
LRM
111
1
0
17 Sep 2025
SAIL-VL2 Technical Report
Weijie Yin
Yongjie Ye
Fangxun Shu
Yue Liao
Zijian Kang
...
Han Wang
Wenzhuo Liu
Xiao Liang
Shuicheng Yan
Chao Feng
LRM
VLM
301
4
0
17 Sep 2025
AsyMoE: Leveraging Modal Asymmetry for Enhanced Expert Specialization in Large Vision-Language Models
Heng Zhang
Haichuan Hu
Yaomin Shen
Weihao Yu
Yilei Yuan
...
Zijian Zhang
Lubin Gan
Huihui Wei
Hao Zhang
Jin Huang
MoE
354
0
0
16 Sep 2025
3D Aware Region Prompted Vision Language Model
A. Cheng
Yang Fu
Yukang Chen
Zhijian Liu
X. Li
...
Jan Kautz
Pavlo Molchanov
Hongxu Yin
Xiaolong Wang
Sifei Liu
139
8
0
16 Sep 2025
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
Tianyu Yu
Zefan Wang
Chongyi Wang
Fuwei Huang
Wenshuo Ma
...
Ning Ding
Xu Han
Xingtai Lv
Zhiyuan Liu
Maosong Sun
MLLM
VLM
209
25
0
16 Sep 2025
MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs
Feilong Chen
Y. Liu
Yi Huang
Hao Wang
Miren Tian
Ya-Qi Yu
Minghui Liao
Jihao Wu
MLLM
VLM
333
1
0
15 Sep 2025
PATIMT-Bench: A Multi-Scenario Benchmark for Position-Aware Text Image Machine Translation in Large Vision-Language Models
Wanru Zhuang
Wenbo Li
Zhibin Lan
Xu Han
P. Li
Jinsong Su
VLM
142
0
0
14 Sep 2025
Towards Reliable and Interpretable Document Question Answering via VLMs
Alessio Chen
Simone Giovannini
Andrea Gemelli
Fabio Coppini
S. Marinai
193
0
0
12 Sep 2025
VARCO-VISION-2.0 Technical Report
Young-rok Cha
Jeongho Ju
SunYoung Park
Jong-Hyeon Lee
Younghyun Yu
Youngjune Kim
VLM
219
2
0
12 Sep 2025
Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning
Haiyang Yu
Y. Wu
Fan Shi
Lei Liao
Jinghui Lu
...
Liyuan Meng
Chao Feng
Can Huang
Jingqun Tang
Bin Li
VLM
95
4
0
10 Sep 2025
Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images
Boammani Aser Lompo
Marc Haraoui
LMTD
ReLM
VLM
LRM
132
1
0
09 Sep 2025
Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models
Jaemin Son
Sujin Choi
Inyong Yun
VLM
98
0
0
08 Sep 2025
MoLoRAG: Bootstrapping Document Understanding via Multi-modal Logic-aware Retrieval
Xixi Wu
Yanchao Tan
Nan Hou
Ruiyang Zhang
Hong Cheng
139
4
0
06 Sep 2025
OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation
Han Li
Xinyu Peng
Y. Wang
Zelin Peng
Xin Chen
Rongxiang Weng
Jingang Wang
Xunliang Cai
Wenrui Dai
Hongkai Xiong
MLLM
OffRL
366
13
0
03 Sep 2025
VLMs-in-the-Wild: Bridging the Gap Between Academic Benchmarks and Enterprise Reality
Srihari Bandraupalli
Anupam Purwar
VLM
72
1
0
03 Sep 2025
A-SEA3L-QA: A Fully Automated Self-Evolving, Adversarial Workflow for Arabic Long-Context Question-Answer Generation
Kesen Wang
Daulet Toibazar
Pedro J. Moreno
82
0
0
02 Sep 2025
MoPEQ: Mixture of Mixed Precision Quantized Experts
Krishna Teja Chitty-Venkata
Jie Ye
M. Emani
MoE
MQ
98
2
0
02 Sep 2025
CAD2DMD-SET: Synthetic Generation Tool of Digital Measurement Device CAD Model Datasets for fine-tuning Large Vision-Language Models
Joao Valente
Atabak Dehban
Rodrigo Ventura
107
1
0
29 Aug 2025
SUMMA: A Multimodal Large Language Model for Advertisement Summarization
Weitao Jia
Shuo Yin
Zhoufutu Wen
Han Wang
Zehui Dai
Kun Zhang
Zhenyu Li
Tao Zeng
Xiaohui Lv
135
0
0
28 Aug 2025
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
Qi Yang
Bolin Ni
Shiming Xiang
Han Hu
Houwen Peng
Jie Jiang
LRM
208
5
0
28 Aug 2025
KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts
Taebaek Hwang
Minseo Kim
Gisang Lee
Seonuk Kim
Hyunjun Eun
VLM
161
1
0
27 Aug 2025
Extracting Information from Scientific Literature via Visual Table Question Answering Models
Dongyoun Kim
Hyung-do Choi
Youngsun Jang
John Kim
LMTD
94
0
0
26 Aug 2025
Enhancing Document VQA Models via Retrieval-Augmented Generation
Eric López
Artemis LLabres
Ernest Valveny
RALM
222
1
0
26 Aug 2025
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang
Zhangwei Gao
Lixin Gu
Hengjun Pu
Long Cui
...
Bowen Zhou
Kai Chen
Yu Qiao
Wenhai Wang
Gen Luo
MLLM
LRM
305
298
0
25 Aug 2025
Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs
Somraj Gautam
Abhirama Subramanyam Penamakuri
Abhishek Bhandari
Gaurav Harit
LMTD
LRM
267
2
0
24 Aug 2025
MoE-Inference-Bench: Performance Evaluation of Mixture of Expert Large Language and Vision Models
Krishna Teja Chitty-Venkata
Sylvia Howland
Golara Azar
Daria Soboleva
Natalia Vassilieva
Siddhisanket Raskar
M. Emani
V. Vishwanath
MoE
121
1
0
24 Aug 2025
Explain Before You Answer: A Survey on Compositional Visual Reasoning
Fucai Ke
Joy Hsu
Zhixi Cai
Zixian Ma
Xin Zheng
...
P. D. Haghighi
Gholamreza Haffari
Ranjay Krishna
Jiajun Wu
H. Rezatofighi
ReLM
CoGe
LRM
364
10
0
24 Aug 2025
Previous
1
2
3
4
5
6
...
14
15
16
Next
Page 3 of 16
Page
of 16
Go