Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2007.00398
Cited By
v1
v2
v3 (latest)
DocVQA: A Dataset for VQA on Document Images
1 July 2020
Minesh Mathew
Dimosthenis Karatzas
C. V. Jawahar
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (2 upvotes)
Papers citing
"DocVQA: A Dataset for VQA on Document Images"
50 / 755 papers shown
Title
Reinforcement Learning for Large Model: A Survey
Weijia Wu
Chen Gao
Joya Chen
Kevin Lin
Qingwei Meng
Yiming Zhang
Yuke Qiu
Hong Zhou
Mike Zheng Shou
273
2
0
24 Dec 2025
Thinking with Programming Vision: Towards a Unified View for Thinking with Images
Zirun Guo
Minjie Hong
Feng Zhang
Kai Jia
Tao Jin
OffRL
LRM
VLM
132
0
0
03 Dec 2025
Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models
Shojiro Yamabe
Futa Waseda
Daiki Shiono
Tsubasa Takahashi
DiffM
MLLM
VLM
173
0
0
03 Dec 2025
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
Zichuan Lin
Y. Liu
Yang Yang
Lvfang Tao
Deheng Ye
VLM
63
0
0
03 Dec 2025
VACoT: Rethinking Visual Data Augmentation with VLMs
Zhengzhuo Xu
Chong Sun
Sinan Du
Chen Li
Jing Lyu
Chun Yuan
48
0
0
02 Dec 2025
MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm
Wei Chen
Chaoqun Du
Feng Gu
Wei He
Qizhen Li
...
Pengfei Yu
Y. Zheng
Chunpeng Zhou
Pan Zhou
Xuhan Zhu
MLLM
OffRL
VLM
589
1
0
02 Dec 2025
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
Le Thien Phuc Nguyen
Zhuoran Yu
Samuel Low Yu Hang
Subin An
J. Lee
...
SeungEun Chung
Thanh-Huy Nguyen
JuWan Maeng
Soochahn Lee
Yong Jae Lee
AuLLM
VLM
173
0
0
01 Dec 2025
Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding
Keliang Liu
Zizhi Chen
Mingcheng Li
Jingqun Tang
Dingkang Yang
Lihua Zhang
RALM
84
0
0
28 Nov 2025
DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action
Zhen Fang
Zhuoyang Liu
Jiaming Liu
Hao Chen
Y. Zeng
Shiting Huang
Zehui Chen
L. Chen
Shanghang Zhang
Feng Zhao
LRM
76
1
0
27 Nov 2025
From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images
Yiming Chen
Junlin Han
Tianyi Bai
Shengbang Tong
Filippos Kokkinos
Philip Torr
36
0
0
27 Nov 2025
DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA
Ahmad Mohammadshirazi
Pinaki Prasad Guha Neogi
Dheeraj Kulshrestha
R. Ramnath
VGen
80
0
0
27 Nov 2025
CaptionQA: Is Your Caption as Useful as the Image Itself?
Shijia Yang
Yunong Liu
Bohan Zhai
Ximeng Sun
Zicheng Liu
E. Barsoum
Manling Li
Chenfeng Xu
CoGe
167
0
0
26 Nov 2025
EM-KD: Distilling Efficient Multimodal Large Language Model with Unbalanced Vision Tokens
Ze Feng
Sen Yang
Boqiang Duan
Wankou Yang
Jingdong Wang
VLM
149
0
0
26 Nov 2025
Qwen3-VL Technical Report
Shuai Bai
Yuxuan Cai
Ruizhe Chen
Keqin Chen
Xionghui Chen
...
Jingren Zhou
F. I. S. Kevin Zhou
J. Zhou
Yuanzhi Zhu
Ke Zhu
VLM
1.1K
39
0
26 Nov 2025
Text-Guided Semantic Image Encoder
Raghuveer Thirukovalluru
Xiaochuang Han
Bhuwan Dhingra
Emily Dinan
Maha Elbayad
VLM
136
0
0
25 Nov 2025
HKRAG: Holistic Knowledge Retrieval-Augmented Generation Over Visually-Rich Documents
Anyang Tong
Xiang Niu
ZhiPing Liu
Chang Tian
Yanyan Wei
Zenglin Shi
Meng Wang
101
1
0
25 Nov 2025
CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception
Miguel Carvalho
Helder Dias
Bruno Martins
VLM
188
0
0
25 Nov 2025
Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs
Meng Lu
Ran Xu
Yi Fang
Wenxuan Zhang
Yue Yu
...
Guanghua Xiao
Hanrui Wang
Di Jin
W. Shi
Xuan Wang
LRM
116
0
0
24 Nov 2025
DocPTBench: Benchmarking End-to-End Photographed Document Parsing and Translation
Yongkun Du
Pinxuan Chen
Xuye Ying
Z. Chen
116
0
0
23 Nov 2025
ARIAL: An Agentic Framework for Document VQA with Precise Answer Localization
Ahmad Mohammadshirazi
Pinaki Prasad Guha Neogi
Dheeraj Kulshrestha
R. Ramnath
64
0
0
22 Nov 2025
MGA-VQA: Secure and Interpretable Graph-Augmented Visual Question Answering with Memory-Guided Protection Against Unauthorized Knowledge Use
Ahmad Mohammadshirazi
Pinaki Prasad Guha Neogi
Dheeraj Kulshrestha
R. Ramnath
92
0
0
22 Nov 2025
Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models
Mark Endo
Serena Yeung-Levy
LRM
221
0
0
21 Nov 2025
VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning
Lingxiao Li
Y. Wang
Xinyan Gao
Chen Tang
Xiangyu Yue
Chenyu You
LRM
68
1
0
21 Nov 2025
SMILE: A Composite Lexical-Semantic Metric for Question-Answering Evaluation
Shrikant B. Kendre
Austin Xu
Honglu Zhou
Michael S Ryoo
Shafiq Joty
Juan Carlos Niebles
174
0
0
21 Nov 2025
Learning to Think Fast and Slow for Visual Language Models
Chenyu Lin
Cheng Chi
Jinlin Wu
Sharon Li
Kaiyang Zhou
ReLM
VLM
225
0
0
20 Nov 2025
Arctic-Extract Technical Report
Mateusz Chiliński
Julita Ołtusek
Wojciech Ja'skowski
VLM
116
0
0
20 Nov 2025
FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR Evaluation
Yueru He
Xueqing Peng
Yupeng Cao
Yan Wang
Lingfei Qian
...
Mingquan Lin
Prayag Tiwari
Jimin Huang
Guojun Xiong
Sophia Ananiadou
247
0
0
19 Nov 2025
Evaluating Multimodal Large Language Models on Vertically Written Japanese Text
Keito Sasagawa
Shuhei Kurita
Daisuke Kawahara
68
0
0
19 Nov 2025
A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models
Duo Li
Zuhao Yang
Xiaoqin Zhang
Ling Shao
Shijian Lu
VLM
137
1
0
19 Nov 2025
BBox DocVQA: A Large Scale Bounding Box Grounded Dataset for Enhancing Reasoning in Document Visual Question Answer
Wenhan Yu
Wang Chen
Guanqiang Qi
Weikang Li
Yang Li
Lei Sha
Deguo Xia
Jizhou Huang
93
1
0
19 Nov 2025
Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution
N Dinesh Reddy
Dylan Snyder
Lona Kiragu
Mirajul Mohin
Shahrear Bin Amin
Sudeep Pillai
75
0
0
18 Nov 2025
Attention Grounded Enhancement for Visual Document Retrieval
Wanqing Cui
Wei Huang
Yazhi Guo
Yibo Hu
Meiguang Jin
Junfeng Ma
Keping Bi
133
0
0
17 Nov 2025
RedVTP: Training-Free Acceleration of Diffusion Vision-Language Models Inference via Masked Token-Guided Visual Token Pruning
Jingqi Xu
Jingxi Lu
Chenghao Li
Sreetama Sarkar
Souvik Kundu
Peter A. Beerel
VLM
164
0
0
16 Nov 2025
Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data
Yunxin Li
Xinyu Chen
Shenyuan Jiang
Haoyuan Shi
Zhenyu Liu
...
Zhenran Xu
Yicheng Ma
Meishan Zhang
Baotian Hu
Min Zhang
MLLM
MoE
OSLM
VLM
563
1
0
16 Nov 2025
Simple Vision-Language Math Reasoning via Rendered Text
Matvey Skripkin
Elizaveta Goncharova
Andrey Kuznetsov
ReLM
LRM
VLM
316
0
0
12 Nov 2025
TabRAG: Tabular Document Retrieval via Structured Language Representations
Jacob Si
Mike Qu
Michelle Lee
Yingzhen Li
LMTD
3DGS
3DV
227
0
0
10 Nov 2025
SDS KoPub VDR: A Benchmark Dataset for Visual Document Retrieval in Korean Public Documents
Jaehoon Lee
Sohyun Kim
Wanggeun Park
Geon Lee
Seungkyung Kim
Minyoung Lee
130
0
0
07 Nov 2025
Visual Spatial Tuning
Rui Yang
Ziyu Zhu
Yanwei Li
Jingjia Huang
Shen Yan
...
Xiangtai Li
S. Li
Wenqian Wang
Yi Lin
Hengshuang Zhao
VLM
333
5
0
07 Nov 2025
IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs
Ali Faraz
Akash
Shaharukh Khan
Raja Kolla
Akshat Patidar
Suranjan Goswami
Abhinav Ravi
Chandra Khatri
Shubham Agarwal
VLM
156
0
0
06 Nov 2025
ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai
Surapon Nonesung
Teetouch Jaknamon
Sirinya Chaiophat
Natapong Nitarach
Chanakan Wittayasakpan
Warit Sirichotedumrong
Adisai Na-Thalang
Kunat Pipatanakul
VLM
295
0
0
06 Nov 2025
Seeing Straight: Document Orientation Detection for Efficient OCR
Suranjan Goswami
Abhinav Ravi
Raja Kolla
Ali Faraz
Shaharukh Khan
Akash
Chandra Khatri
Shubham Agarwal
170
0
0
06 Nov 2025
NVIDIA Nemotron Nano V2 VL
Nvidia
Amala Sanjay Deshmukh
Kateryna Chumachenko
Tuomas Rintamaki
Matthieu Le
...
Krzysztof Pawelec
Michael Evans
Katherine Luna
Jie Lou
Erick Galinkin
VLM
288
2
0
06 Nov 2025
Cambrian-S: Towards Spatial Supersensing in Video
Shusheng Yang
J. Yang
Pinzhi Huang
Ellis L Brown
Zihao Yang
...
Daohan Lu
Rob Fergus
Yann LeCun
Li Fei-Fei
Saining Xie
160
12
0
06 Nov 2025
What's in Common? Multimodal Models Hallucinate When Reasoning Across Scenes
Candace Ross
Florian Bordes
Adina Williams
Polina Kirichenko
Mark Ibrahim
VLM
ReLM
LRM
180
1
0
05 Nov 2025
CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning
Jizheng Ma
Xiaofei Zhou
Yanlong Song
Han Yan
VLM
LRM
161
1
0
04 Nov 2025
Dynamic Routing Between Experts: A Data-Efficient Approach to Continual Learning in Vision-Language Models
Jay Mohta
Kenan E. Ak
Dimitrios Dimitriadis
Yan Xu
Mingwei Shen
CLL
VLM
266
0
0
03 Nov 2025
The Ouroboros of Benchmarking: Reasoning Evaluation in an Era of Saturation
İbrahim Ethem Deveci
Duygu Ataman
ReLM
ALM
ELM
LRM
199
0
0
03 Nov 2025
ColMate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval
Ahmed Masry
Megh Thakkar
Patrice Bechard
Sathwik Tejaswi Madhusudhan
Rabiul Awal
...
Srivatsava Daruru
Enamul Hoque
Spandana Gella
Torsten Scholak
Sai Rajeswar
VLM
188
0
0
02 Nov 2025
Cross-Lingual SynthDocs: A Large-Scale Synthetic Corpus for Any to Arabic OCR and Document Understanding
Haneen Al-Homoud
Asma A. Ibrahim
Murtadha Al-Jubran
Fahad Al-Otaibi
Yazeed Al-Harbi
Daulet Toibazar
Kesen Wang
Pedro J. Moreno
149
0
0
01 Nov 2025
From Pixels to Paths: A Multi-Agent Framework for Editable Scientific Illustration
Jianwen Sun
Fanrui Zhang
Yukang Feng
Chuanhao Li
Zizhen Li
Jiaxin Ai
Yifan Chang
Yu Dai
Kaipeng Zhang
89
0
0
31 Oct 2025
1
2
3
4
...
14
15
16
Next