ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2007.00398
  4. Cited By
DocVQA: A Dataset for VQA on Document Images
v1v2v3 (latest)

DocVQA: A Dataset for VQA on Document Images

1 July 2020
Minesh Mathew
Dimosthenis Karatzas
C. V. Jawahar
ArXiv (abs)PDFHTMLHuggingFace (2 upvotes)

Papers citing "DocVQA: A Dataset for VQA on Document Images"

50 / 759 papers shown
Lumos : Empowering Multimodal LLMs with Scene Text Recognition
Lumos : Empowering Multimodal LLMs with Scene Text Recognition
Ashish Shenoy
Yichao Lu
Srihari Jayakumar
Debojeet Chatterjee
Mohsen Moslehpour
...
Shicong Zhao
Longfang Zhao
Ankit Ramchandani
Xin Luna Dong
Anuj Kumar
MLLM
214
6
0
12 Feb 2024
Question Aware Vision Transformer for Multimodal Reasoning
Question Aware Vision Transformer for Multimodal Reasoning
Roy Ganz
Yair Kittenplon
Aviad Aberdam
Elad Ben Avraham
Oren Nuriel
Shai Mazor
Ron Litman
299
36
0
08 Feb 2024
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
Chris Liu
Renrui Zhang
Longtian Qiu
Siyuan Huang
Weifeng Lin
...
Hao Shao
Pan Lu
Jiaming Song
Yu Qiao
Shiyang Feng
MLLM
512
139
0
08 Feb 2024
TreeForm: End-to-end Annotation and Evaluation for Form Document Parsing
TreeForm: End-to-end Annotation and Evaluation for Form Document Parsing
Ran Zmigrod
Zhiqiang Ma
Armineh Nourbakhsh
Sameena Shah
192
5
0
07 Feb 2024
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Gilles Baechler
Srinivas Sunkara
Maria Wang
Fedir Zubach
Hassan Mansoor
Vincent Etter
Victor Carbune
Jason Lin
Jindong Chen
Abhanshu Sharma
846
96
0
07 Feb 2024
ANLS* -- A Universal Document Processing Metric for Generative Large
  Language Models
ANLS* -- A Universal Document Processing Metric for Generative Large Language Models
David Peer
Philemon Schöpf
V. Nebendahl
A. Rietzler
Sebastian Stabinger
305
8
0
06 Feb 2024
Can MLLMs Perform Text-to-Image In-Context Learning?
Can MLLMs Perform Text-to-Image In-Context Learning?
Yuchen Zeng
Wonjun Kang
Yicong Chen
Hyung Il Koo
Kangwook Lee
MLLM
263
14
0
02 Feb 2024
Instruction Makes a Difference
Instruction Makes a Difference
Tosin Adewumi
Nudrat Habib
Lama Alkhaled
Elisa Barney
VLMMLLM
286
2
0
01 Feb 2024
LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts
  in Instruction Finetuning MLLMs
LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs
Shaoxiang Chen
Zequn Jie
Lin Ma
MoE
404
85
0
29 Jan 2024
Muffin or Chihuahua? Challenging Multimodal Large Language Models with
  Multipanel VQA
Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQAAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Yue Fan
Jing Gu
KAI-QING Zhou
Qianqi Yan
Shan Jiang
Ching-Chen Kuo
Xinze Guan
Xin Eric Wang
291
11
0
29 Jan 2024
LongFin: A Multimodal Document Understanding Model for Long Financial
  Domain Documents
LongFin: A Multimodal Document Understanding Model for Long Financial Domain Documents
Ahmed Masry
Amir Hajian
144
5
0
26 Jan 2024
MM-LLMs: Recent Advances in MultiModal Large Language Models
MM-LLMs: Recent Advances in MultiModal Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Duzhen Zhang
Yahan Yu
Jiahua Dong
Chenxing Li
Dan Su
Chenhui Chu
Dong Yu
OffRLLRM
512
333
0
24 Jan 2024
InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document
  Understanding with Instructions
InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with InstructionsAAAI Conference on Artificial Intelligence (AAAI), 2024
Ryota Tanaka
Taichi Iki
Kyosuke Nishida
Kuniko Saito
Jun Suzuki
VLM
258
36
0
24 Jan 2024
Small Language Model Meets with Reinforced Vision Vocabulary
Small Language Model Meets with Reinforced Vision Vocabulary
Haoran Wei
Lingyu Kong
Jinyue Chen
Liang Zhao
Zheng Ge
En Yu
Jian‐Yuan Sun
Chunrui Han
Xiangyu Zhang
VLM
239
47
0
23 Jan 2024
InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks
InfiAgent-DABench: Evaluating Agents on Data Analysis TasksInternational Conference on Machine Learning (ICML), 2024
Xueyu Hu
Ziyu Zhao
Shuang Wei
Ziwei Chai
Qianli Ma
...
Jiwei Li
Kun Kuang
Yang Yang
Hongxia Yang
Leilei Gan
LMTDELM
268
93
0
10 Jan 2024
Exploring the Reasoning Abilities of Multimodal Large Language Models
  (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning
Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning
Yiqi Wang
Wentao Chen
Xiaotian Han
Xudong Lin
Haiteng Zhao
Yongfei Liu
Bohan Zhai
Jianbo Yuan
Quanzeng You
Hongxia Yang
LRM
308
146
0
10 Jan 2024
GRAM: Global Reasoning for Multi-Page VQA
GRAM: Global Reasoning for Multi-Page VQA
Tsachi Blau
Sharon Fogel
Roi Ronen
Alona Golts
Roy Ganz
Elad Ben Avraham
Aviad Aberdam
Shahar Tsiper
Ron Litman
231
21
0
07 Jan 2024
Incorporating Visual Experts to Resolve the Information Loss in
  Multimodal Large Language Models
Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models
Xin He
Longhui Wei
Lingxi Xie
Qi Tian
314
13
0
06 Jan 2024
DocGraphLM: Documental Graph Language Model for Information Extraction
DocGraphLM: Documental Graph Language Model for Information Extraction
Dongsheng Wang
Zhiqiang Ma
Armineh Nourbakhsh
Kang Gu
Sameena Shah
165
13
0
05 Jan 2024
DocLLM: A layout-aware generative language model for multimodal document
  understanding
DocLLM: A layout-aware generative language model for multimodal document understandingAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Dongsheng Wang
Natraj Raman
Mathieu Sibue
Zhiqiang Ma
Petr Babkin
Simerjot Kaur
Yulong Pei
Armineh Nourbakhsh
Xiaomo Liu
VLM
276
106
0
31 Dec 2023
An Empirical Study of Scaling Law for OCR
An Empirical Study of Scaling Law for OCR
Miao Rang
Zhenni Bi
Chuanjian Liu
Yunhe Wang
Kai Han
430
12
0
29 Dec 2023
Visual Instruction Tuning towards General-Purpose Multimodal Model: A
  Survey
Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
Jiaxing Huang
Jingyi Zhang
Kai Jiang
Han Qiu
Shijian Lu
195
30
0
27 Dec 2023
Privacy-Aware Document Visual Question Answering
Privacy-Aware Document Visual Question AnsweringIEEE International Conference on Document Analysis and Recognition (ICDAR), 2023
Rubèn Pérez Tito
Khanh Nguyen
Marlon Tobaben
Raouf Kerkouche
Mohamed Ali Souibgui
...
Lei Kang
Ernest Valveny
Antti Honkela
Mario Fritz
Dimosthenis Karatzas
219
16
0
15 Dec 2023
Depicting Beyond Scores: Advancing Image Quality Assessment through
  Multi-modal Language Models
Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language ModelsEuropean Conference on Computer Vision (ECCV), 2023
Zhiyuan You
Zheyuan Li
Jinjin Gu
Zhenfei Yin
Tianfan Xue
Chao Dong
EGVM
394
90
0
14 Dec 2023
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
Haoran Wei
Lingyu Kong
Jinyue Chen
Liang Zhao
Zheng Ge
Jinrong Yang
Jian‐Yuan Sun
Chunrui Han
Xiangyu Zhang
MLLMVLM
271
88
0
11 Dec 2023
Lyrics: Boosting Fine-grained Language-Vision Alignment and
  Comprehension via Semantic-aware Visual Objects
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects
Junyu Lu
Ruyi Gan
Di Zhang
Xiaojun Wu
Ziwei Wu
Renliang Sun
Jiaxing Zhang
Pingjian Zhang
Yan Song
MLLMVLM
225
22
0
08 Dec 2023
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of
  Low-rank Experts
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank ExpertsComputer Vision and Pattern Recognition (CVPR), 2023
Jialin Wu
Xia Hu
Yaqing Wang
Bo Pang
Radu Soricut
MoE
258
33
0
01 Dec 2023
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large
  Language Model
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language ModelACM Multimedia (ACM MM), 2023
Anwen Hu
Yaya Shi
Haiyang Xu
Jiabo Ye
Qinghao Ye
Mingshi Yan
Chenliang Li
Qi Qian
Ji Zhang
Fei Huang
MLLM
255
33
0
30 Nov 2023
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
MVBench: A Comprehensive Multi-modal Video Understanding BenchmarkComputer Vision and Pattern Recognition (CVPR), 2023
Kunchang Li
Yali Wang
Yinan He
Yizhuo Li
Yi Wang
...
Jilan Xu
Guo Chen
Ping Luo
Limin Wang
Yu Qiao
VLMMLLM
664
857
0
28 Nov 2023
Fully Authentic Visual Question Answering Dataset from Online
  Communities
Fully Authentic Visual Question Answering Dataset from Online CommunitiesEuropean Conference on Computer Vision (ECCV), 2023
Chongyan Chen
Xiyang Dai
Noel Codella
Yunsheng Li
Lu Yuan
Danna Gurari
373
9
0
27 Nov 2023
Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs
Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs
Yunxin Li
Zhenyu Liu
Wei Wang
Xiaochun Cao
Yuxin Ding
Xiaochun Cao
Min Zhang
181
6
0
27 Nov 2023
EIGEN: Expert-Informed Joint Learning Aggregation for High-Fidelity
  Information Extraction from Document Images
EIGEN: Expert-Informed Joint Learning Aggregation for High-Fidelity Information Extraction from Document Images
A. Singh
Venkatapathy Subramanian
Ayush Maheshwari
Pradeep Narayan
D. P. Shetty
Ganesh Ramakrishnan
126
3
0
23 Nov 2023
Towards Improving Document Understanding: An Exploration on
  Text-Grounding via MLLMs
Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs
Yonghui Wang
Wen-gang Zhou
Hao Feng
Keyi Zhou
Houqiang Li
298
25
0
22 Nov 2023
DocPedia: Unleashing the Power of Large Multimodal Model in the
  Frequency Domain for Versatile Document Understanding
DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding
Hao Feng
Qi Liu
Hao Liu
Wen-gang Zhou
Houqiang Li
Can Huang
VLM
344
95
0
20 Nov 2023
Efficient End-to-End Visual Document Understanding with Rationale
  Distillation
Efficient End-to-End Visual Document Understanding with Rationale Distillation
Peng Guo
Alekh Agarwal
Mandar Joshi
Robin Jia
Jesse Thomason
Kristina Toutanova
151
4
0
16 Nov 2023
MMC: Advancing Multimodal Chart Understanding with Large-scale
  Instruction Tuning
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction TuningNorth American Chapter of the Association for Computational Linguistics (NAACL), 2023
Fuxiao Liu
Xiaoyang Wang
Wenlin Yao
Jianshu Chen
Kaiqiang Song
Sangwoo Cho
Yaser Yacoob
Dong Yu
220
163
0
15 Nov 2023
DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder
  Transformer Models
DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder Transformer Models
Peng Tang
Pengkai Zhu
Tian Li
Srikar Appalaraju
Vijay Mahadevan
R. Manmatha
226
9
0
15 Nov 2023
Multiple-Question Multiple-Answer Text-VQA
Multiple-Question Multiple-Answer Text-VQANorth American Chapter of the Association for Computational Linguistics (NAACL), 2023
Peng Tang
Srikar Appalaraju
R. Manmatha
Yusheng Xie
Vijay Mahadevan
211
7
0
15 Nov 2023
What Large Language Models Bring to Text-rich VQA?
What Large Language Models Bring to Text-rich VQA?
Xuejing Liu
Wei Tang
Xinzhe Ni
Jinghui Lu
Rui Zhao
Zechao Li
Fei Tan
142
11
0
13 Nov 2023
Monkey: Image Resolution and Text Label Are Important Things for Large
  Multi-modal Models
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal ModelsComputer Vision and Pattern Recognition (CVPR), 2023
Zhang Li
Biao Yang
Qiang Liu
Zhiyin Ma
Shuo Zhang
Jingxu Yang
Yabo Sun
Yuliang Liu
Xiang Bai
MLLM
492
382
0
11 Nov 2023
OtterHD: A High-Resolution Multi-modality Model
OtterHD: A High-Resolution Multi-modality Model
Yue Liu
Peiyuan Zhang
Jingkang Yang
Yuanhan Zhang
Fanyi Pu
Ziwei Liu
VLMMLLM
187
76
0
07 Nov 2023
From Image to Language: A Critical Analysis of Visual Question Answering
  (VQA) Approaches, Challenges, and Opportunities
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and OpportunitiesInformation Fusion (Inf. Fusion), 2023
Md Farhan Ishmam
Md Sakib Hossain Shovon
M. F. Mridha
Nilanjan Dey
399
71
0
01 Nov 2023
Enhancing Document Information Analysis with Multi-Task Pre-training: A
  Robust Approach for Information Extraction in Visually-Rich Documents
Enhancing Document Information Analysis with Multi-Task Pre-training: A Robust Approach for Information Extraction in Visually-Rich DocumentsIEEE International Joint Conference on Neural Network (IJCNN), 2023
Tofik Ali
Partha Pratim Roy
207
0
0
25 Oct 2023
A Multi-Modal Multilingual Benchmark for Document Image Classification
A Multi-Modal Multilingual Benchmark for Document Image ClassificationConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Yoshinari Fujinuma
Siddharth Varia
Nishant Sankaran
Srikar Appalaraju
Bonan Min
Yogarshi Vyas
VLM
240
5
0
25 Oct 2023
Non-Intrusive Adaptation: Input-Centric Parameter-efficient Fine-Tuning
  for Versatile Multimodal Modeling
Non-Intrusive Adaptation: Input-Centric Parameter-efficient Fine-Tuning for Versatile Multimodal Modeling
Yaqing Wang
Jialin Wu
T. Dabral
Jiageng Zhang
Geoff Brown
...
Frederick Liu
Yi Liang
Bo Pang
Michael Bendersky
Radu Soricut
VLM
182
19
0
18 Oct 2023
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
Xi Chen
Xiao Wang
Lucas Beyer
Alexander Kolesnikov
Jialin Wu
...
Keran Rong
Tianli Yu
Daniel Keysers
Xiao-Qi Zhai
Radu Soricut
MLLMVLM
295
139
0
13 Oct 2023
UReader: Universal OCR-free Visually-situated Language Understanding
  with Multimodal Large Language Model
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language ModelConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Jiabo Ye
Anwen Hu
Haiyang Xu
Qinghao Ye
Mingshi Yan
...
Ji Zhang
Qin Jin
Liang He
Xin Lin
Feiyan Huang
VLMMLLM
334
125
0
08 Oct 2023
ReForm-Eval: Evaluating Large Vision Language Models via Unified
  Re-Formulation of Task-Oriented Benchmarks
ReForm-Eval: Evaluating Large Vision Language Models via Unified Re-Formulation of Task-Oriented BenchmarksACM Multimedia (ACM MM), 2023
Zejun Li
Ye Wang
Mengfei Du
Qingwen Liu
Binhao Wu
...
Zhihao Fan
Jie Fu
Jingjing Chen
Xuanjing Huang
Zhongyu Wei
303
16
0
04 Oct 2023
GridFormer: Towards Accurate Table Structure Recognition via Grid
  Prediction
GridFormer: Towards Accurate Table Structure Recognition via Grid PredictionACM Multimedia (ACM MM), 2023
Pengyuan Lyu
Weihong Ma
Hongyi Wang
Yu Yu
Chengquan Zhang
Kun Yao
Yang Xue
Jingdong Wang
LMTD
286
18
0
26 Sep 2023
Analyzing the Efficacy of an LLM-Only Approach for Image-based Document
  Question Answering
Analyzing the Efficacy of an LLM-Only Approach for Image-based Document Question Answering
Nidhi Hegde
S. Paul
Gagan Madan
Gaurav Aggarwal
223
9
0
25 Sep 2023
Previous
123...1213141516
Next