ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2007.00398
  4. Cited By
DocVQA: A Dataset for VQA on Document Images
v1v2v3 (latest)

DocVQA: A Dataset for VQA on Document Images

1 July 2020
Minesh Mathew
Dimosthenis Karatzas
C. V. Jawahar
ArXiv (abs)PDFHTMLHuggingFace (2 upvotes)

Papers citing "DocVQA: A Dataset for VQA on Document Images"

50 / 759 papers shown
See then Tell: Enhancing Key Information Extraction with Vision Grounding
See then Tell: Enhancing Key Information Extraction with Vision Grounding
Shuhang Liu
Zhenrong Zhang
Pengfei Hu
Jiefeng Ma
Jun Du
Qing Wang
Jianshu Zhang
Chenyu Liu
253
1
0
29 Sep 2024
From Seconds to Hours: Reviewing MultiModal Large Language Models on
  Comprehensive Long Video Understanding
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding
Heqing Zou
Tianze Luo
Guiyang Xie
Victor
Zhang
...
Guangcong Wang
Juanyang Chen
Zhuochen Wang
Hansheng Zhang
Huaijian Zhang
VLM
299
19
0
27 Sep 2024
Emu3: Next-Token Prediction is All You Need
Emu3: Next-Token Prediction is All You Need
Xinlong Wang
Xiaosong Zhang
Zhengxiong Luo
Quan-Sen Sun
Yufeng Cui
...
Xi Yang
Jingjing Liu
Yonghua Lin
Tiejun Huang
Zhongyuan Wang
MLLM
292
495
0
27 Sep 2024
CLLMate: A Multimodal Benchmark for Weather and Climate Events Forecasting
CLLMate: A Multimodal Benchmark for Weather and Climate Events Forecasting
Haobo Li
Zhaowei Wang
Jiachen Wang
Yuanbo Wang
Alexis Kai Hon Lau
Huamin Qu
137
0
0
27 Sep 2024
DARE: Diverse Visual Question Answering with Robustness Evaluation
DARE: Diverse Visual Question Answering with Robustness EvaluationTransactions of the Association for Computational Linguistics (TACL), 2024
Hannah Sterz
Jonas Pfeiffer
Ivan Vulić
OODVLM
348
4
0
26 Sep 2024
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid EmotionsComputer Vision and Pattern Recognition (CVPR), 2024
Kai Chen
Yunhao Gou
Runhui Huang
Zhili Liu
Daxin Tan
...
Qun Liu
Jun Yao
Lu Hou
Hang Xu
Hang Xu
AuLLMMLLMVLM
447
44
0
26 Sep 2024
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art
  Multimodal Models
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal ModelsComputer Vision and Pattern Recognition (CVPR), 2024
Matt Deitke
Christopher Clark
Sangho Lee
Rohun Tripathi
Yue Yang
...
Noah A. Smith
Hannaneh Hajishirzi
Ross Girshick
Ali Farhadi
Aniruddha Kembhavi
OSLMVLM
470
58
0
25 Sep 2024
A comprehensive study of on-device NLP applications -- VQA, automated
  Form filling, Smart Replies for Linguistic Codeswitching
A comprehensive study of on-device NLP applications -- VQA, automated Form filling, Smart Replies for Linguistic Codeswitching
Naman Goyal
183
0
0
23 Sep 2024
Phantom of Latent for Large Language and Vision Models
Phantom of Latent for Large Language and Vision Models
Byung-Kwan Lee
Sangyun Chung
Chae Won Kim
Beomchan Park
Yong Man Ro
VLMLRM
285
12
0
23 Sep 2024
A-VL: Adaptive Attention for Large Vision-Language Models
A-VL: Adaptive Attention for Large Vision-Language ModelsAAAI Conference on Artificial Intelligence (AAAI), 2024
Junyang Zhang
Mu Yuan
Ruiguang Zhong
Puhan Luo
Huiyou Zhan
Ningkang Zhang
Chengchen Hu
Xiangyang Li
VLM
418
4
0
23 Sep 2024
One Model for Two Tasks: Cooperatively Recognizing and Recovering
  Low-Resolution Scene Text Images by Iterative Mutual Guidance
One Model for Two Tasks: Cooperatively Recognizing and Recovering Low-Resolution Scene Text Images by Iterative Mutual Guidance
Minyi Zhao
Yang Wang
Jihong Guan
Shuigeng Zhou
195
0
0
22 Sep 2024
AVG-LLaVA: An Efficient Large Multimodal Model with Adaptive Visual Granularity
AVG-LLaVA: An Efficient Large Multimodal Model with Adaptive Visual GranularityAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Zhibin Lan
Liqiang Niu
Fandong Meng
Wenbo Li
Jie Zhou
Jinsong Su
VLM
316
4
0
20 Sep 2024
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced
  Mathematical Reasoning
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning
Xiaotian Han
Yiren Jian
Xuefeng Hu
Haogeng Liu
Yiqi Wang
...
Yuang Ai
Huaibo Huang
Ran He
Zhenheng Yang
Quanzeng You
LRMAI4CE
206
32
0
19 Sep 2024
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary ResolutionInternational Conference on Learning Representations (ICLR), 2024
Zuyan Liu
Yuhao Dong
Ziwei Liu
Winston Hu
Jiwen Lu
Yongming Rao
ObjD
614
134
0
19 Sep 2024
NVLM: Open Frontier-Class Multimodal LLMs
NVLM: Open Frontier-Class Multimodal LLMs
Wenliang Dai
Nayeon Lee
Wei Ping
Zhuoling Yang
Zihan Liu
Jon Barker
Tuomas Rintamaki
Mohammad Shoeybi
Bryan Catanzaro
Ming-Yu Liu
MLLMVLMLRM
308
114
0
17 Sep 2024
Leveraging Distillation Techniques for Document Understanding: A Case
  Study with FLAN-T5
Leveraging Distillation Techniques for Document Understanding: A Case Study with FLAN-T5Jahrestagung der Gesellschaft für Informatik (GI Jahrestagung), 2024
Marcel Lamott
Muhammad Armaghan Shakir
196
5
0
17 Sep 2024
Shaking Up VLMs: Comparing Transformers and Structured State Space
  Models for Vision & Language Modeling
Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language ModelingConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Georgios Pantazopoulos
Malvina Nikandrou
Alessandro Suglia
Oliver Lemon
Arash Eshghi
Mamba
313
5
0
09 Sep 2024
RexUniNLU: Recursive Method with Explicit Schema Instructor for
  Universal NLU
RexUniNLU: Recursive Method with Explicit Schema Instructor for Universal NLU
Chengyuan Liu
Shihang Wang
Fubang Zhao
Kun Kuang
Yangyang Kang
Weiming Lu
Changlong Sun
Fei Wu
227
1
0
09 Sep 2024
POINTS: Improving Your Vision-language Model with Affordable Strategies
POINTS: Improving Your Vision-language Model with Affordable Strategies
Yuan Liu
Zhongyin Zhao
Ziyuan Zhuang
Le Tian
Xiao Zhou
Jie Zhou
VLM
261
12
0
07 Sep 2024
WebQuest: A Benchmark for Multimodal QA on Web Page Sequences
WebQuest: A Benchmark for Multimodal QA on Web Page Sequences
Maria Wang
Srinivas Sunkara
Gilles Baechler
Jason Lin
Yun Zhu
Fedir Zubach
Lei Shu
Jindong Chen
LRMLLMAG
315
12
0
06 Sep 2024
UNIT: Unifying Image and Text Recognition in One Vision Encoder
UNIT: Unifying Image and Text Recognition in One Vision EncoderNeural Information Processing Systems (NeurIPS), 2024
Yi Zhu
Yanpeng Zhou
Chunwei Wang
Yang Cao
Jianhua Han
Lu Hou
Hang Xu
ViTVLM
317
9
0
06 Sep 2024
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page
  Document Understanding
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document UnderstandingAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Anwen Hu
Haiyang Xu
Liang Zhang
Jiabo Ye
Ming Yan
Ji Zhang
Qin Jin
Fei Huang
Jingren Zhou
VLM
406
83
0
05 Sep 2024
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Haoran Wei
Chenglong Liu
Jinyue Chen
Jia Wang
Lingyu Kong
...
Liang Zhao
Jianjian Sun
Yuang Peng
Chunrui Han
Xiangyu Zhang
VLM
221
121
0
03 Sep 2024
The MERIT Dataset: Modelling and Efficiently Rendering Interpretable
  Transcripts
The MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts
I. de Rodrigo
A. Sanchez-Cuadrado
J. Boal
A. J. Lopez-Lopez
VLM
252
2
0
31 Aug 2024
AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene
  Understanding
AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding
Yonghui Wang
Wengang Zhou
Hao Feng
Houqiang Li
VLM
166
1
0
30 Aug 2024
CogVLM2: Visual Language Models for Image and Video Understanding
CogVLM2: Visual Language Models for Image and Video Understanding
Wenyi Hong
Weihan Wang
Ming Ding
Wenmeng Yu
Qingsong Lv
...
Debing Liu
Bin Xu
Juanzi Li
Yuxiao Dong
Jie Tang
VLMMLLM
303
200
0
29 Aug 2024
μgat: Improving Single-Page Document Parsing by Providing Multi-Page
  Context
μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context
Fabio Quattrini
Carmine Zaccagnino
Silvia Cascianelli
Laura Righi
Rita Cucchiara
188
4
0
28 Aug 2024
GlaLSTM: A Concurrent LSTM Stream Framework for Glaucoma Detection via Biomarker Mining
GlaLSTM: A Concurrent LSTM Stream Framework for Glaucoma Detection via Biomarker Mining
Cheng Huang
Weizheng Xie
Tsengdar J. Lee
Karanjit S Kooner
Ning Zhang
Yishen Liu
382
75
0
28 Aug 2024
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Min Shi
Fuxiao Liu
Shihao Wang
Shijia Liao
Subhashree Radhakrishnan
...
Andrew Tao
Andrew Tao
Zhiding Yu
Guilin Liu
Guilin Liu
MLLM
408
116
0
28 Aug 2024
DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding
DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document UnderstandingComputer Vision and Pattern Recognition (CVPR), 2024
Wenhui Liao
Jiapeng Wang
Hongliang Li
Chengyu Wang
Jun Huang
Lianwen Jin
595
0
0
27 Aug 2024
IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities
IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal CapabilitiesAAAI Conference on Artificial Intelligence (AAAI), 2024
Bin Wang
Chunyu Xie
Dawei Leng
Yuhui Yin
MLLM
486
6
0
23 Aug 2024
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?International Conference on Learning Representations (ICLR), 2024
Yi-Fan Zhang
Huanyu Zhang
Haochen Tian
Chaoyou Fu
Shuangqing Zhang
...
Qingsong Wen
Zhang Zhang
Liwen Wang
Rong Jin
Tieniu Tan
OffRL
374
138
0
23 Aug 2024
Building and better understanding vision-language models: insights and
  future directions
Building and better understanding vision-language models: insights and future directions
Hugo Laurençon
Andrés Marafioti
Victor Sanh
Léo Tronchon
VLM
320
133
0
22 Aug 2024
Large Language Models for Page Stream Segmentation
Large Language Models for Page Stream Segmentation
H. Heidenreich
Ratish Dalvi
Rohith Mukku
Nikhil Verma
Neven Pičuljan
194
4
0
21 Aug 2024
DocTabQA: Answering Questions from Long Documents Using Tables
DocTabQA: Answering Questions from Long Documents Using TablesIEEE International Conference on Document Analysis and Recognition (ICDAR), 2024
Haochen Wang
Kai Hu
Haoyu Dong
Liangcai Gao
RALMLMTD
211
7
0
21 Aug 2024
EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model
EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model
Feipeng Ma
Yizhou Zhou
Hebei Li
Zilong He
Siying Wu
Fengyun Rao
Siying Wu
Fengyun Rao
Yueyi Zhang
Xiaoyan Sun
486
10
0
21 Aug 2024
HiRED: Attention-Guided Token Dropping for Efficient Inference of
  High-Resolution Vision-Language Models in Resource-Constrained Environments
HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments
Kazi Hasan Ibn Arif
JinYi Yoon
Dimitrios S. Nikolopoulos
Hans Vandierendonck
Deepu John
Bo Ji
MLLMVLM
233
25
0
20 Aug 2024
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Le Xue
Manli Shu
Anas Awadalla
Jun Wang
An Yan
...
Zeyuan Chen
Silvio Savarese
Juan Carlos Niebles
Caiming Xiong
Ran Xu
VLM
531
146
0
16 Aug 2024
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li
Yuanhan Zhang
Dong Guo
Renrui Zhang
Feng Li
Hao Zhang
Kaichen Zhang
Yanwei Li
Ziwei Liu
Chunyuan Li
MLLMSyDaVLM
581
1,788
0
06 Aug 2024
Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language
  Models
Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language ModelsInternational Conference on Learning Representations (ICLR), 2024
Mingxin Huang
Yuliang Liu
Dingkang Liang
Lianwen Jin
Xiang Bai
301
2
0
04 Aug 2024
Deep Learning based Visually Rich Document Content Understanding: A Survey
Deep Learning based Visually Rich Document Content Understanding: A Survey
Muhammad Ali
Jean Lee
Salman Khan
Eduard Hovy
471
16
0
02 Aug 2024
LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models
LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models
Ruiyi Zhang
Jiuxiang Gu
Jian Chen
Jiuxiang Gu
Changyou Chen
Tongfei Sun
VLM
150
13
0
27 Jul 2024
OfficeBench: Benchmarking Language Agents across Multiple Applications
  for Office Automation
OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation
Zilong Wang
Yuedong Cui
Li Zhong
Zimin Zhang
Da Yin
Bill Yuchen Lin
Jingbo Shang
266
22
0
26 Jul 2024
MangaUB: A Manga Understanding Benchmark for Large Multimodal Models
MangaUB: A Manga Understanding Benchmark for Large Multimodal Models
Hikaru Ikuta
Leslie Wöhler
Kiyoharu Aizawa
250
4
0
26 Jul 2024
MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs
MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs
Jihyung Kil
Zheda Mai
Justin Lee
Zihe Wang
Kerrie Cheng
Jingyan Bai
Ye Liu
A. Chowdhury
Wei-Lun Chao
CoGeVLM
387
1
0
23 Jul 2024
Harmonizing Visual Text Comprehension and Generation
Harmonizing Visual Text Comprehension and Generation
Zhen Zhao
Jingqun Tang
Binghong Wu
Chunhui Lin
Shubo Wei
Hao Liu
Xin Tan
Zhizhong Zhang
Can Huang
Yuan Xie
VLM
328
40
0
23 Jul 2024
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal
  Large Language Model
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model
Yiwei Ma
Zhibin Wang
Xiaoshuai Sun
Weihuang Lin
Qiang-feng Zhou
Jiayi Ji
Rongrong Ji
MLLMVLM
244
4
0
23 Jul 2024
Token-level Correlation-guided Compression for Efficient Multimodal
  Document Understanding
Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding
Renshan Zhang
Yibo Lyu
Rui Shao
Gongwei Chen
Weili Guan
Liqiang Nie
242
19
0
19 Jul 2024
MoME: Mixture of Multimodal Experts for Generalist Multimodal Large
  Language Models
MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models
Leyang Shen
Gongwei Chen
Rui Shao
Weili Guan
Liqiang Nie
MoE
203
34
0
17 Jul 2024
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding
Ofir Abramovich
Niv Nayman
Sharon Fogel
I. Lavi
Ron Litman
Shahar Tsiper
Royee Tichauer
Srikar Appalaraju
Shai Mazor
R. Manmatha
VLM
359
6
0
17 Jul 2024
Previous
123...91011...141516
Next
Page 10 of 16
Pageof 16