Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2007.00398
Cited By
v1
v2
v3 (latest)
DocVQA: A Dataset for VQA on Document Images
1 July 2020
Minesh Mathew
Dimosthenis Karatzas
C. V. Jawahar
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (2 upvotes)
Papers citing
"DocVQA: A Dataset for VQA on Document Images"
50 / 759 papers shown
A Survey of Multimodal Large Language Model from A Data-centric Perspective
Tianyi Bai
Hao Liang
Binwang Wan
Yanran Xu
Xi Li
...
Ping Huang
Jiulong Shan
Conghui He
Binhang Yuan
Wentao Zhang
383
64
0
26 May 2024
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models
Chunjiang Ge
Sijie Cheng
Xiangqi Jin
Jiale Yuan
Yuan Gao
Jun Song
Shiji Song
Gao Huang
Bo Zheng
MLLM
VLM
199
23
0
24 May 2024
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Byung-Kwan Lee
Chae Won Kim
Beomchan Park
Yonghyun Ro
MLLM
LRM
339
28
0
24 May 2024
Unveiling the Tapestry of Consistency in Large Vision-Language Models
Neural Information Processing Systems (NeurIPS), 2024
Yuan Zhang
Fei Xiao
Tao Huang
Chun-Kai Fan
Hongyuan Dong
Jiawen Li
Jiacong Wang
Kuan Cheng
Shanghang Zhang
Haoyuan Guo
336
21
0
23 May 2024
Imp: Highly Capable Large Multimodal Models for Mobile Devices
Zhenwei Shao
Zhou Yu
Jun Yu
Xuecheng Ouyang
Lihao Zheng
Zhenbiao Gai
Mingyang Wang
Jiajun Ding
263
23
0
20 May 2024
Rethinking Overlooked Aspects in Vision-Language Models
Yuan Liu
Le Tian
Xiao Zhou
Jie Zhou
VLM
230
2
0
20 May 2024
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering
Jingqun Tang
Qi-dong Liu
Yongjie Ye
Jinghui Lu
Shubo Wei
...
Hao Liu
Xiang Bai
Can Huang
Xiang Bai
Can Huang
788
50
0
20 May 2024
Efficient Multimodal Large Language Models: A Survey
Yizhang Jin
Jian Li
Yexin Liu
Tianjun Gu
Kai Wu
...
Xin Tan
Zhenye Gan
Yabiao Wang
Chengjie Wang
Lizhuang Ma
LRM
307
86
0
17 May 2024
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
Chengyue Wu
Yixiao Ge
Qiushan Guo
Jiahao Wang
Zhixuan Liang
Zeyu Lu
Ying Shan
Ping Luo
MLLM
VLM
145
3
0
13 May 2024
Federated Document Visual Question Answering: A Pilot Study
Khanh Nguyen
Dimosthenis Karatzas
FedML
316
0
0
10 May 2024
Exploring the Capabilities of Large Multimodal Models on Dense Text
IEEE International Conference on Document Analysis and Recognition (ICDAR), 2024
Shuo Zhang
Biao Yang
Zhang Li
Zhiyin Ma
Yuliang Liu
Xiang Bai
VLM
201
12
0
09 May 2024
Lightweight Spatial Modeling for Combinatorial Information Extraction From Documents
Yanfei Dong
Lambert Deng
Jiazheng Zhang
Xiaodong Yu
Ting Lin
Francesco Gelli
Soujanya Poria
W. Lee
192
0
0
08 May 2024
GeoContrastNet: Contrastive Key-Value Edge Learning for Language-Agnostic Document Understanding
IEEE International Conference on Document Analysis and Recognition (ICDAR), 2024
Nil Biescas
Carlos Boned Riera
Josep Lladós
Sanket Biswas
202
4
0
06 May 2024
What matters when building vision-language models?
Neural Information Processing Systems (NeurIPS), 2024
Hugo Laurençon
Léo Tronchon
Matthieu Cord
Victor Sanh
VLM
302
276
0
03 May 2024
MANTIS: Interleaved Multi-Image Instruction Tuning
Dongfu Jiang
Xuan He
Huaye Zeng
Cong Wei
Max Ku
Qian Liu
Wenhu Chen
VLM
MLLM
416
183
0
02 May 2024
KVP10k : A Comprehensive Dataset for Key-Value Pair Extraction in Business Documents
O. Naparstek
Roi Pony
Inbar Shapira
Foad Abo Dahood
Ophir Azulai
...
Idan Friedman
Orit Prince
Yevgeny Burshtein
Adi Raz Goldfarb
Udi Barzelay
187
3
0
01 May 2024
CREPE: Coordinate-Aware End-to-End Document Parser
Yamato Okamoto
Youngmin Baek
Geewook Kim
Ryota Nakao
Donghyun Kim
Moonbin Yim
Seunghyun Park
Bado Lee
256
2
0
01 May 2024
Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism
Lei Kang
Rubèn Pérez Tito
Ernest Valveny
Dimosthenis Karatzas
251
10
0
29 Apr 2024
MileBench: Benchmarking MLLMs in Long Context
Dingjie Song
Shunian Chen
Guiming Hardy Chen
Fei Yu
Xiang Wan
Benyou Wang
VLM
369
62
0
29 Apr 2024
ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images
Huy Quang Pham
Thang Kien-Bao Nguyen
Quan Van Nguyen
Dan Quang Tran
Nghia Hieu Nguyen
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
225
10
0
29 Apr 2024
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen
Weiyun Wang
Hao Tian
Shenglong Ye
Zhangwei Gao
...
Tong Lu
Dahua Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
MLLM
VLM
528
983
0
25 Apr 2024
An empirical study of LLaMA3 quantization: from LLMs to MLLMs
Wei Huang
Xingyu Zheng
Xudong Ma
Haotong Qin
Chengtao Lv
Hong Chen
Jie Luo
Xiaojuan Qi
Xianglong Liu
Michele Magno
MQ
549
66
0
22 Apr 2024
MoVA: Adapting Mixture of Vision Experts to Multimodal Context
Zhuofan Zong
Bingqi Ma
Dazhong Shen
Guanglu Song
Hao Shao
Dongzhi Jiang
Jiaming Song
Yu Liu
MoE
300
86
0
19 Apr 2024
PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering
Yihao Ding
Kaixuan Ren
Jiabin Huang
Siwen Luo
S. Han
214
4
0
19 Apr 2024
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Jingqun Tang
Chunhui Lin
Zhen Zhao
Shubo Wei
Binghong Wu
...
Yuliang Liu
Xiang Bai
Can Huang
Xiang Bai
Can Huang
LRM
VLM
MLLM
467
45
0
19 Apr 2024
ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images
Quan Van Nguyen
Dan Quang Tran
Huy Quang Pham
Thang Kien-Bao Nguyen
Nghia Hieu Nguyen
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
CoGe
626
8
0
16 Apr 2024
TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding
Bozhi Luan
Hao Feng
Hong Chen
Yonghui Wang
Wen-gang Zhou
Houqiang Li
MLLM
244
26
0
15 Apr 2024
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models
Ya-Qi Yu
Minghui Liao
Jihao Wu
Yongxin Liao
Xiaoyu Zheng
Wei Zeng
VLM
213
21
0
14 Apr 2024
HRVDA: High-Resolution Visual Document Assistant
Computer Vision and Pattern Recognition (CVPR), 2024
Chaohu Liu
Kun Yin
Haoyu Cao
Xinghua Jiang
Xin Li
Yinsong Liu
Deqiang Jiang
Xing Sun
Linli Xu
VLM
277
32
0
10 Apr 2024
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
Neural Information Processing Systems (NeurIPS), 2024
Xiao-wen Dong
Pan Zhang
Yuhang Zang
Yuhang Cao
Sijin Yu
...
Xingcheng Zhang
Jifeng Dai
Yuxin Qiao
Dahua Lin
Yuan Liu
VLM
MLLM
276
159
0
09 Apr 2024
LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding
Chuwei Luo
Yufan Shen
Zhaoqing Zhu
Qi Zheng
Zhi Yu
Cong Yao
376
98
0
08 Apr 2024
BuDDIE: A Business Document Dataset for Multi-task Information Extraction
Ran Zmigrod
Dongsheng Wang
Mathieu Sibue
Yulong Pei
Petr Babkin
...
Antony Papadimitriou
William Watson
Zhiqiang Ma
Armineh Nourbakhsh
Sameena Shah
217
7
0
05 Apr 2024
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
Weifeng Lin
Xinyu Wei
Ruichuan An
Shiyang Feng
Bocheng Zou
Yulin Luo
Siyuan Huang
Shanghang Zhang
Jiaming Song
VLM
402
84
0
29 Mar 2024
JDocQA: Japanese Document Question Answering Dataset for Generative Language Models
Eri Onami
Shuhei Kurita
Taiki Miyanishi
Taro Watanabe
285
9
0
28 Mar 2024
Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models
Jiaxing Chen
Yuxuan Liu
Dehu Li
Xiang An
Weimo Deng
Ziyong Feng
Yongle Zhao
Yin Xie
LRM
201
19
0
28 Mar 2024
Can AI Models Appreciate Document Aesthetics? An Exploration of Legibility and Layout Quality in Relation to Prediction Confidence
Hsiu-Wei Yang
Abhinav Agrawal
Pavlos Fragkogiannis
Shubham Nitin Mulay
262
3
0
27 Mar 2024
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
Hao Shao
Shengju Qian
Han Xiao
Guanglu Song
Zhuofan Zong
Letian Wang
Yu Liu
Jiaming Song
VGen
LRM
MLLM
346
216
0
25 Mar 2024
Visually Guided Generative Text-Layout Pre-training for Document Intelligence
Zhiming Mao
Haoli Bai
Lu Hou
Jiansheng Wei
Xin Jiang
Qun Liu
Kam-Fai Wong
220
12
0
25 Mar 2024
LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding
Masato Fujitake
MLLM
175
18
0
21 Mar 2024
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
Anwen Hu
Haiyang Xu
Jiabo Ye
Mingshi Yan
Liang Zhang
...
Chen Li
Ji Zhang
Qin Jin
Fei Huang
Jingren Zhou
VLM
305
196
0
19 Mar 2024
From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models
IEEE Transactions on Knowledge and Data Engineering (TKDE), 2024
Kung-Hsiang Huang
Hou Pong Chan
Yi R. Fung
Haoyi Qiu
Mingyang Zhou
Shafiq Joty
Shih-Fu Chang
Chenhui Xu
AI4TS
471
56
0
18 Mar 2024
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Brandon McKinzie
Zhe Gan
J. Fauconnier
Sam Dodge
Bowen Zhang
...
Zirui Wang
Ruoming Pang
Peter Grasch
Alexander Toshev
Yinfei Yang
MLLM
521
246
0
14 Mar 2024
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
Yuliang Liu
Biao Yang
Qiang Liu
Zhang Li
Zhiyin Ma
Shuo Zhang
Xiang Bai
MLLM
VLM
313
150
0
07 Mar 2024
Transformers and Language Models in Form Understanding: A Comprehensive Review of Scanned Document Analysis
Abdelrahman Abdallah
Daniel Eberharter
Zoe Pfister
Adam Jatowt
188
15
0
06 Mar 2024
Model-Based Data-Centric AI: Bridging the Divide Between Academic Ideals and Industrial Pragmatism
Chanjun Park
Minsoo Khang
Dahyun Kim
164
2
0
04 Mar 2024
InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding
Haogeng Liu
Quanzeng You
Xiaotian Han
Yiqi Wang
Bohan Zhai
Yongfei Liu
Yunzhe Tao
Huaibo Huang
Ran He
Hongxia Yang
MLLM
149
11
0
03 Mar 2024
Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models
Xin Li
Yunfei Wu
Xinghua Jiang
Zhihao Guo
Ming Gong
Haoyu Cao
Yinsong Liu
Deqiang Jiang
Xing Sun
VLM
208
36
0
29 Feb 2024
Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding
Hongshen Xu
Lu Chen
Zihan Zhao
Da Ma
Ruisheng Cao
Zichen Zhu
Kai Yu
155
5
0
28 Feb 2024
Improving Language Understanding from Screenshots
Tianyu Gao
Zirui Wang
Adithya Bhaskar
Danqi Chen
VLM
201
13
0
21 Feb 2024
RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning
Congyun Jin
Ming Zhang
Xiaowei Ma
Yujiao Li
Yingbo Wang
...
Chenfei Chi
Xiangguo Lv
Fangzhou Li
Wei Xue
Yiran Huang
LM&MA
184
10
0
19 Feb 2024
Previous
1
2
3
...
11
12
13
14
15
16
Next