ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2007.00398
  4. Cited By
DocVQA: A Dataset for VQA on Document Images
v1v2v3 (latest)

DocVQA: A Dataset for VQA on Document Images

1 July 2020
Minesh Mathew
Dimosthenis Karatzas
C. V. Jawahar
ArXiv (abs)PDFHTMLHuggingFace (2 upvotes)

Papers citing "DocVQA: A Dataset for VQA on Document Images"

50 / 755 papers shown
Title
RegionRAG: Region-level Retrieval-Augmented Generation for Visual Document Understanding
RegionRAG: Region-level Retrieval-Augmented Generation for Visual Document Understanding
Yinglu Li
Zhiying Lu
Zhihang Liu
Chuanbin Liu
Hongtao Xie
Hongtao Xie
VLM
218
1
0
31 Oct 2025
LongCat-Flash-Omni Technical Report
LongCat-Flash-Omni Technical Report
M-A-P Team
Bairui Wang
Bayan
Bin Xiao
Bo Zhang
...
Xin Pan
Xin Chen
Xiusong Sun
Xu Xiang
X. Xing
MLLMVLM
510
2
0
31 Oct 2025
SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding
SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding
Yiqiao Jin
Rachneet Kaur
Zhen Zeng
Sumitra Ganesh
Srijan Kumar
156
0
0
30 Oct 2025
ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents
ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents
Tianyu Yang
Terry Ruas
Yijun Tian
Jan Philip Wahle
Daniel Kurzawe
Bela Gipp
VLM
262
0
0
29 Oct 2025
SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs
SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs
Jinhong Deng
Wen Li
Joey Tianyi Zhou
Yang He
LRM
96
0
0
28 Oct 2025
Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation
Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation
Inclusion AI
Bowen Ma
Cheng Zou
C. Yan
Chunxiang Jin
...
Zhiqiang Fang
Zhihao Qiu
Ziyuan Huang
Zizheng Yang
Z. He
MLLMMoEVLM
294
2
0
28 Oct 2025
Revisiting Multimodal Positional Encoding in Vision-Language Models
Revisiting Multimodal Positional Encoding in Vision-Language Models
Jie Huang
Xuejing Liu
Sibo Song
Ruibing Hou
Hong Chang
Junyang Lin
S. Bai
148
1
0
27 Oct 2025
KBE-DME: Dynamic Multimodal Evaluation via Knowledge Enhanced Benchmark Evolution
KBE-DME: Dynamic Multimodal Evaluation via Knowledge Enhanced Benchmark Evolution
Junzhe Zhang
Huixuan Zhang
Xiaojun Wan
53
0
0
24 Oct 2025
Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models
Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models
Xuyang Liu
Xiyan Gui
Y. Zhang
Linfeng Zhang
VLM
100
2
0
23 Oct 2025
GranViT: A Fine-Grained Vision Model With Autoregressive Perception For MLLMs
GranViT: A Fine-Grained Vision Model With Autoregressive Perception For MLLMs
Guanghao Zheng
Bowen Shi
Mingxing Xu
Ruoyu Sun
Peisen Zhao
...
Wenrui Dai
Junni Zou
Hongkai Xiong
Xiaopeng Zhang
Qi Tian
VLM
147
0
0
23 Oct 2025
SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models
SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models
Gyubeum Lim
Yemo Koo
Vijay Krishna Madisetti
92
0
0
22 Oct 2025
Unified Reinforcement and Imitation Learning for Vision-Language Models
Unified Reinforcement and Imitation Learning for Vision-Language Models
Byung-Kwan Lee
Ryo Hachiuma
Yong Man Ro
Yu-Chun Wang
Yueh-Hua Wu
VLM
156
1
0
22 Oct 2025
CARES: Context-Aware Resolution Selector for VLMs
CARES: Context-Aware Resolution Selector for VLMs
Moshe Kimhi
Nimrod Shabtay
Raja Giryes
Chaim Baskin
Eli Schwartz
VLM
116
0
0
22 Oct 2025
UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in Omni Models
UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in Omni Models
Chen Chen
ZeYang Hu
Fengjiao Chen
Liya Ma
Jiaxing Liu
Xiaoyu Li
Xuezhi Cao
Xuezhi Cao
Xunliang Cai
166
0
0
21 Oct 2025
Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs
Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs
Zhining Liu
Ziyi Chen
Hui Liu
Chen Luo
Xianfeng Tang
...
Zhenwei Dai
Zhan Shi
Tianxin Wei
Benoit Dumoulin
Hanghang Tong
LRM
122
1
0
20 Oct 2025
SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference
SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference
Samir Khaki
Junxian Guo
Jiaming Tang
Shang Yang
Yukang Chen
Konstantinos N. Plataniotis
Yao Lu
Song Han
Zhijian Liu
MLLMVLM
161
1
0
20 Oct 2025
FineVision: Open Data Is All You Need
FineVision: Open Data Is All You Need
Luis Wiedmann
Orr Zohar
Amir Mahla
Xiaohan Wang
Rui Li
Thibaud Frere
Leandro von Werra
Aritra Roy Gosthipaty
Andrés Marafioti
VLM
180
12
0
20 Oct 2025
RL makes MLLMs see better than SFT
RL makes MLLMs see better than SFT
Junha Song
Sangdoo Yun
Dongyoon Han
Jaegul Choo
Byeongho Heo
OffRL
179
0
0
18 Oct 2025
PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies
PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies
Lukas Selch
Yufang Hou
Muhammad Jehanzeb Mirza
Sivan Doveh
James Glass
Rogerio Feris
Wei Lin
202
0
0
18 Oct 2025
VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs
VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs
Jiaying Zhu
Yurui Zhu
Xin Lu
Wenrui Yan
Dong Li
Kunlin Liu
Xueyang Fu
Zheng-Jun Zha
MQVLM
227
0
0
18 Oct 2025
Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding
Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding
Sensen Gao
Shanshan Zhao
Xu Jiang
Lunhao Duan
Yong Xien Chng
Qing-Guo Chen
Weihua Luo
Kaifu Zhang
Jia-Wang Bian
Mingming Gong
230
0
0
17 Oct 2025
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
Hanrong Ye
Chao-Han Huck Yang
Arushi Goel
Wei Huang
Ligeng Zhu
...
Andrew Tao
Song Han
Jan Kautz
Hongxu Yin
Pavlo Molchanov
174
3
0
17 Oct 2025
Vision-Centric Activation and Coordination for Multimodal Large Language Models
Vision-Centric Activation and Coordination for Multimodal Large Language Models
Yunnan Wang
Fan Lu
Kecheng Zheng
Ziyuan Huang
Ziqiang Li
Wenjun Zeng
Xin Jin
MLLM
328
0
0
16 Oct 2025
Document Intelligence in the Era of Large Language Models: A Survey
Document Intelligence in the Era of Large Language Models: A Survey
Weishi Wang
Hengchang Hu
Zhijie Zhang
Zhaochen Li
Hongxin Shao
Daniel Dahlmeier
AI4TS
168
0
0
15 Oct 2025
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue
Wenwen Tong
Hewei Guo
Dongchuan Ran
Jiangnan Chen
Jiefan Lu
...
Dinghao Zhou
Guiping Zhong
Ken Zheng
Shiyin Kang
Lewei Lu
MLLMAuLLMVGenVLM
400
4
0
15 Oct 2025
ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution
ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution
Long Cui
Weiyun Wang
Jie Shao
Zichen Wen
Gen Luo
Linfeng Zhang
Y. Zhang
Yu Qiao
Wenhai Wang
VLM
160
2
0
14 Oct 2025
DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search
DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search
Kartik Narayan
Yang Xu
Tian Cao
Kavya Nerella
Vishal M. Patel
Navid Shiee
Peter Grasch
Chao Jia
Yinfei Yang
Zhe Gan
ObjDKELMVLM
232
3
0
14 Oct 2025
MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites
MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites
Zhenxin Lei
Zhangwei Gao
Changyao Tian
Erfei Cui
Guanzhou Chen
...
Xiangyu Zhao
Jiayi Ji
Yu Qiao
Wenhai Wang
Gen Luo
VLM
237
0
0
14 Oct 2025
FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models
FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models
Shengming Yuan
Xinyu Lyu
Shuailong Wang
Beitao Chen
Jingkuan Song
Lianli Gao
LRM
240
0
0
13 Oct 2025
Scaling Language-Centric Omnimodal Representation Learning
Scaling Language-Centric Omnimodal Representation Learning
Chenghao Xiao
Hou Pong Chan
Hao Zhang
Weiwen Xu
Mahani Aljunied
Yu Rong
128
0
0
13 Oct 2025
Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey
Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey
Jinxuan Li
Chaolei Tan
Haoxuan Chen
Jianxin Ma
Jian-Fang Hu
Wei-Shi Zheng
Jianhuang Lai
VLM
133
1
0
12 Oct 2025
ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models
ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models
Yuqi Liu
Liangyu Chen
Jiazhen Liu
Mingkang Zhu
Zhisheng Zhong
Bei Yu
Jiaya Jia
LRM
140
0
0
12 Oct 2025
Hybrid OCR-LLM Framework for Enterprise-Scale Document Information Extraction Under Copy-heavy Task
Hybrid OCR-LLM Framework for Enterprise-Scale Document Information Extraction Under Copy-heavy Task
Zilong Wang
Xiaoyu Shen
48
0
0
11 Oct 2025
Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping
Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping
Dwip Dalal
Gautam Vashishtha
Utkarsh Mishra
Jeonghwan Kim
Madhav Kanda
Hyeonjeong Ha
Svetlana Lazebnik
Mengyue Yang
Unnat Jain
213
2
0
10 Oct 2025
CapGeo: A Caption-Assisted Approach to Geometric Reasoning
CapGeo: A Caption-Assisted Approach to Geometric Reasoning
Y. Li
Siyi Qian
Hao Liang
Leqi Zheng
Ruichuan An
Yongzhen Guo
Wentao Zhang
ReLMLRM
100
0
0
10 Oct 2025
How to Teach Large Multimodal Models New Skills
How to Teach Large Multimodal Models New Skills
Zhen Zhu
Yiming Gong
Yao Xiao
Yaoyao Liu
Derek Hoiem
MLLMCLLKELM
165
0
0
09 Oct 2025
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
Changyao Tian
Hao Li
Gen Luo
Xizhou Zhu
Weijie Su
...
Y. Liu
Lewei Lu
Wenhai Wang
Hongsheng Li
Jifeng Dai
113
1
0
09 Oct 2025
Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods
Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods
Chenfei Liao
Wensong Wang
Zichen Wen
Xu Zheng
Y. Wang
...
Xin Zou
Yuqian Fu
Bin Ren
Linfeng Zhang
Xuming Hu
93
1
0
08 Oct 2025
LAD-RAG: Layout-aware Dynamic RAG for Visually-Rich Document Understanding
LAD-RAG: Layout-aware Dynamic RAG for Visually-Rich Document Understanding
Zhivar Sourati
Zheng Wang
Marianne Menglin Liu
Yazhe Hu
Mengqing Guo
...
Kyu Han
Tao Sheng
Sujith Ravi
Morteza Dehghani
Dan Roth
80
0
0
08 Oct 2025
Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization
Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization
Omri Uzan
Asaf Yehudai
Roi pony
Eyal Shnarch
Ariel Gera
178
1
0
06 Oct 2025
Your Vision-Language Model Can't Even Count to 20: Exposing the Failures of VLMs in Compositional Counting
Your Vision-Language Model Can't Even Count to 20: Exposing the Failures of VLMs in Compositional Counting
Xuyang Guo
Zekai Huang
Zhenmei Shi
Zhao Song
Jiahao Zhang
CoGeVLM
262
4
0
06 Oct 2025
Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment
Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment
Radha Gulhane
Sathish Reddy Indurthi
OffRLLRM
56
0
0
06 Oct 2025
UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG
UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG
Xiangyu Peng
Cab Qin
Zeyuan Chen
Ran Xu
Caiming Xiong
Chien-Sheng Wu
VLM
174
0
0
04 Oct 2025
Exploring OCR-augmented Generation for Bilingual VQA
Exploring OCR-augmented Generation for Bilingual VQA
JoonHo Lee
Sunho Park
VLM
100
0
0
02 Oct 2025
ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models
ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models
Krishna Teja Chitty-Venkata
M. Emani
MLLMVGenLRMVLM
177
1
0
02 Oct 2025
Training-free Uncertainty Guidance for Complex Visual Tasks with MLLMs
Training-free Uncertainty Guidance for Complex Visual Tasks with MLLMs
Sanghwan Kim
Rui Xiao
Stephan Alaniz
Yongqin Xian
Zeynep Akata
108
0
0
01 Oct 2025
Interpret, prune and distill Donut : towards lightweight VLMs for VQA on document
Interpret, prune and distill Donut : towards lightweight VLMs for VQA on document
Adnan Ben Mansour
Ayoub Karine
D. Naccache
112
0
0
30 Sep 2025
Defeating Cerberus: Concept-Guided Privacy-Leakage Mitigation in Multimodal Language Models
Defeating Cerberus: Concept-Guided Privacy-Leakage Mitigation in Multimodal Language Models
Boyang Zhang
Istemi Ekin Akkus
Ruichuan Chen
Alice Dethise
Klaus Satzke
Ivica Rimac
Yang Zhang
PILM
146
0
0
29 Sep 2025
OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding
OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding
Jiancong Xie
Wenjin Wang
Zhuomeng Zhang
Zihan Liu
Qi Liu
Ke Feng
Zixun Sun
Yuedong Yang
VLM
69
0
0
29 Sep 2025
From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models
From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models
Chenyue Zhou
Mingxuan Wang
Yanbiao Ma
Chenxu Wu
Wanyi Chen
...
Guoli Jia
Lingling Li
Z. Lu
Y. Lu
Wenhan Luo
LRM
407
9
0
29 Sep 2025
Previous
12345...141516
Next