Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2007.00398
Cited By
v1
v2
v3 (latest)
DocVQA: A Dataset for VQA on Document Images
1 July 2020
Minesh Mathew
Dimosthenis Karatzas
C. V. Jawahar
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (2 upvotes)
Papers citing
"DocVQA: A Dataset for VQA on Document Images"
50 / 755 papers shown
Title
RegionRAG: Region-level Retrieval-Augmented Generation for Visual Document Understanding
Yinglu Li
Zhiying Lu
Zhihang Liu
Chuanbin Liu
Hongtao Xie
Hongtao Xie
VLM
218
1
0
31 Oct 2025
LongCat-Flash-Omni Technical Report
M-A-P Team
Bairui Wang
Bayan
Bin Xiao
Bo Zhang
...
Xin Pan
Xin Chen
Xiusong Sun
Xu Xiang
X. Xing
MLLM
VLM
510
2
0
31 Oct 2025
SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding
Yiqiao Jin
Rachneet Kaur
Zhen Zeng
Sumitra Ganesh
Srijan Kumar
156
0
0
30 Oct 2025
ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents
Tianyu Yang
Terry Ruas
Yijun Tian
Jan Philip Wahle
Daniel Kurzawe
Bela Gipp
VLM
262
0
0
29 Oct 2025
SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs
Jinhong Deng
Wen Li
Joey Tianyi Zhou
Yang He
LRM
96
0
0
28 Oct 2025
Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation
Inclusion AI
Bowen Ma
Cheng Zou
C. Yan
Chunxiang Jin
...
Zhiqiang Fang
Zhihao Qiu
Ziyuan Huang
Zizheng Yang
Z. He
MLLM
MoE
VLM
294
2
0
28 Oct 2025
Revisiting Multimodal Positional Encoding in Vision-Language Models
Jie Huang
Xuejing Liu
Sibo Song
Ruibing Hou
Hong Chang
Junyang Lin
S. Bai
148
1
0
27 Oct 2025
KBE-DME: Dynamic Multimodal Evaluation via Knowledge Enhanced Benchmark Evolution
Junzhe Zhang
Huixuan Zhang
Xiaojun Wan
53
0
0
24 Oct 2025
Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models
Xuyang Liu
Xiyan Gui
Y. Zhang
Linfeng Zhang
VLM
100
2
0
23 Oct 2025
GranViT: A Fine-Grained Vision Model With Autoregressive Perception For MLLMs
Guanghao Zheng
Bowen Shi
Mingxing Xu
Ruoyu Sun
Peisen Zhao
...
Wenrui Dai
Junni Zou
Hongkai Xiong
Xiaopeng Zhang
Qi Tian
VLM
147
0
0
23 Oct 2025
SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models
Gyubeum Lim
Yemo Koo
Vijay Krishna Madisetti
92
0
0
22 Oct 2025
Unified Reinforcement and Imitation Learning for Vision-Language Models
Byung-Kwan Lee
Ryo Hachiuma
Yong Man Ro
Yu-Chun Wang
Yueh-Hua Wu
VLM
156
1
0
22 Oct 2025
CARES: Context-Aware Resolution Selector for VLMs
Moshe Kimhi
Nimrod Shabtay
Raja Giryes
Chaim Baskin
Eli Schwartz
VLM
116
0
0
22 Oct 2025
UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in Omni Models
Chen Chen
ZeYang Hu
Fengjiao Chen
Liya Ma
Jiaxing Liu
Xiaoyu Li
Xuezhi Cao
Xuezhi Cao
Xunliang Cai
166
0
0
21 Oct 2025
Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs
Zhining Liu
Ziyi Chen
Hui Liu
Chen Luo
Xianfeng Tang
...
Zhenwei Dai
Zhan Shi
Tianxin Wei
Benoit Dumoulin
Hanghang Tong
LRM
122
1
0
20 Oct 2025
SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference
Samir Khaki
Junxian Guo
Jiaming Tang
Shang Yang
Yukang Chen
Konstantinos N. Plataniotis
Yao Lu
Song Han
Zhijian Liu
MLLM
VLM
161
1
0
20 Oct 2025
FineVision: Open Data Is All You Need
Luis Wiedmann
Orr Zohar
Amir Mahla
Xiaohan Wang
Rui Li
Thibaud Frere
Leandro von Werra
Aritra Roy Gosthipaty
Andrés Marafioti
VLM
180
12
0
20 Oct 2025
RL makes MLLMs see better than SFT
Junha Song
Sangdoo Yun
Dongyoon Han
Jaegul Choo
Byeongho Heo
OffRL
179
0
0
18 Oct 2025
PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies
Lukas Selch
Yufang Hou
Muhammad Jehanzeb Mirza
Sivan Doveh
James Glass
Rogerio Feris
Wei Lin
202
0
0
18 Oct 2025
VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs
Jiaying Zhu
Yurui Zhu
Xin Lu
Wenrui Yan
Dong Li
Kunlin Liu
Xueyang Fu
Zheng-Jun Zha
MQ
VLM
227
0
0
18 Oct 2025
Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding
Sensen Gao
Shanshan Zhao
Xu Jiang
Lunhao Duan
Yong Xien Chng
Qing-Guo Chen
Weihua Luo
Kaifu Zhang
Jia-Wang Bian
Mingming Gong
230
0
0
17 Oct 2025
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
Hanrong Ye
Chao-Han Huck Yang
Arushi Goel
Wei Huang
Ligeng Zhu
...
Andrew Tao
Song Han
Jan Kautz
Hongxu Yin
Pavlo Molchanov
174
3
0
17 Oct 2025
Vision-Centric Activation and Coordination for Multimodal Large Language Models
Yunnan Wang
Fan Lu
Kecheng Zheng
Ziyuan Huang
Ziqiang Li
Wenjun Zeng
Xin Jin
MLLM
328
0
0
16 Oct 2025
Document Intelligence in the Era of Large Language Models: A Survey
Weishi Wang
Hengchang Hu
Zhijie Zhang
Zhaochen Li
Hongxin Shao
Daniel Dahlmeier
AI4TS
168
0
0
15 Oct 2025
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue
Wenwen Tong
Hewei Guo
Dongchuan Ran
Jiangnan Chen
Jiefan Lu
...
Dinghao Zhou
Guiping Zhong
Ken Zheng
Shiyin Kang
Lewei Lu
MLLM
AuLLM
VGen
VLM
400
4
0
15 Oct 2025
ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution
Long Cui
Weiyun Wang
Jie Shao
Zichen Wen
Gen Luo
Linfeng Zhang
Y. Zhang
Yu Qiao
Wenhai Wang
VLM
160
2
0
14 Oct 2025
DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search
Kartik Narayan
Yang Xu
Tian Cao
Kavya Nerella
Vishal M. Patel
Navid Shiee
Peter Grasch
Chao Jia
Yinfei Yang
Zhe Gan
ObjD
KELM
VLM
232
3
0
14 Oct 2025
MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites
Zhenxin Lei
Zhangwei Gao
Changyao Tian
Erfei Cui
Guanzhou Chen
...
Xiangyu Zhao
Jiayi Ji
Yu Qiao
Wenhai Wang
Gen Luo
VLM
237
0
0
14 Oct 2025
FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models
Shengming Yuan
Xinyu Lyu
Shuailong Wang
Beitao Chen
Jingkuan Song
Lianli Gao
LRM
240
0
0
13 Oct 2025
Scaling Language-Centric Omnimodal Representation Learning
Chenghao Xiao
Hou Pong Chan
Hao Zhang
Weiwen Xu
Mahani Aljunied
Yu Rong
128
0
0
13 Oct 2025
Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey
Jinxuan Li
Chaolei Tan
Haoxuan Chen
Jianxin Ma
Jian-Fang Hu
Wei-Shi Zheng
Jianhuang Lai
VLM
133
1
0
12 Oct 2025
ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models
Yuqi Liu
Liangyu Chen
Jiazhen Liu
Mingkang Zhu
Zhisheng Zhong
Bei Yu
Jiaya Jia
LRM
140
0
0
12 Oct 2025
Hybrid OCR-LLM Framework for Enterprise-Scale Document Information Extraction Under Copy-heavy Task
Zilong Wang
Xiaoyu Shen
48
0
0
11 Oct 2025
Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping
Dwip Dalal
Gautam Vashishtha
Utkarsh Mishra
Jeonghwan Kim
Madhav Kanda
Hyeonjeong Ha
Svetlana Lazebnik
Mengyue Yang
Unnat Jain
213
2
0
10 Oct 2025
CapGeo: A Caption-Assisted Approach to Geometric Reasoning
Y. Li
Siyi Qian
Hao Liang
Leqi Zheng
Ruichuan An
Yongzhen Guo
Wentao Zhang
ReLM
LRM
100
0
0
10 Oct 2025
How to Teach Large Multimodal Models New Skills
Zhen Zhu
Yiming Gong
Yao Xiao
Yaoyao Liu
Derek Hoiem
MLLM
CLL
KELM
165
0
0
09 Oct 2025
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
Changyao Tian
Hao Li
Gen Luo
Xizhou Zhu
Weijie Su
...
Y. Liu
Lewei Lu
Wenhai Wang
Hongsheng Li
Jifeng Dai
113
1
0
09 Oct 2025
Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods
Chenfei Liao
Wensong Wang
Zichen Wen
Xu Zheng
Y. Wang
...
Xin Zou
Yuqian Fu
Bin Ren
Linfeng Zhang
Xuming Hu
93
1
0
08 Oct 2025
LAD-RAG: Layout-aware Dynamic RAG for Visually-Rich Document Understanding
Zhivar Sourati
Zheng Wang
Marianne Menglin Liu
Yazhe Hu
Mengqing Guo
...
Kyu Han
Tao Sheng
Sujith Ravi
Morteza Dehghani
Dan Roth
80
0
0
08 Oct 2025
Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization
Omri Uzan
Asaf Yehudai
Roi pony
Eyal Shnarch
Ariel Gera
178
1
0
06 Oct 2025
Your Vision-Language Model Can't Even Count to 20: Exposing the Failures of VLMs in Compositional Counting
Xuyang Guo
Zekai Huang
Zhenmei Shi
Zhao Song
Jiahao Zhang
CoGe
VLM
262
4
0
06 Oct 2025
Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment
Radha Gulhane
Sathish Reddy Indurthi
OffRL
LRM
56
0
0
06 Oct 2025
UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG
Xiangyu Peng
Cab Qin
Zeyuan Chen
Ran Xu
Caiming Xiong
Chien-Sheng Wu
VLM
174
0
0
04 Oct 2025
Exploring OCR-augmented Generation for Bilingual VQA
JoonHo Lee
Sunho Park
VLM
100
0
0
02 Oct 2025
ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models
Krishna Teja Chitty-Venkata
M. Emani
MLLM
VGen
LRM
VLM
177
1
0
02 Oct 2025
Training-free Uncertainty Guidance for Complex Visual Tasks with MLLMs
Sanghwan Kim
Rui Xiao
Stephan Alaniz
Yongqin Xian
Zeynep Akata
108
0
0
01 Oct 2025
Interpret, prune and distill Donut : towards lightweight VLMs for VQA on document
Adnan Ben Mansour
Ayoub Karine
D. Naccache
112
0
0
30 Sep 2025
Defeating Cerberus: Concept-Guided Privacy-Leakage Mitigation in Multimodal Language Models
Boyang Zhang
Istemi Ekin Akkus
Ruichuan Chen
Alice Dethise
Klaus Satzke
Ivica Rimac
Yang Zhang
PILM
146
0
0
29 Sep 2025
OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding
Jiancong Xie
Wenjin Wang
Zhuomeng Zhang
Zihan Liu
Qi Liu
Ke Feng
Zixun Sun
Yuedong Yang
VLM
69
0
0
29 Sep 2025
From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models
Chenyue Zhou
Mingxuan Wang
Yanbiao Ma
Chenxu Wu
Wanyi Chen
...
Guoli Jia
Lingling Li
Z. Lu
Y. Lu
Wenhan Luo
LRM
407
9
0
29 Sep 2025
Previous
1
2
3
4
5
...
14
15
16
Next