v1v2v3 (latest)

DocVQA: A Dataset for VQA on Document Images

1 July 2020

Minesh Mathew

Dimosthenis Karatzas

C. V. Jawahar

ArXiv (abs)PDF HTML HuggingFace (2 upvotes)

Papers citing "DocVQA: A Dataset for VQA on Document Images"

50 / 759 papers shown

A Survey of Multimodal Large Language Model from A Data-centric Perspective

...

Conghui He

383

26 May 2024

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

Gao Huang

199

24 May 2024

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Chae Won Kim

339

24 May 2024

Unveiling the Tapestry of Consistency in Large Vision-Language ModelsNeural Information Processing Systems (NeurIPS), 2024

Yuan Zhang

336

23 May 2024

Imp: Highly Capable Large Multimodal Models for Mobile Devices

263

20 May 2024

Rethinking Overlooked Aspects in Vision-Language Models

Yuan Liu

Le Tian

Xiao Zhou

Jie Zhou

VLM

230

20 May 2024

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering

...

788

20 May 2024

Efficient Multimodal Large Language Models: A Survey

Yizhang Jin

Jian Li

Yexin Liu

Tianjun Gu

Kai Wu

...

Xin Tan

Zhenye Gan

Yabiao Wang

Chengjie Wang

Lizhuang Ma

LRM

307

17 May 2024

Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots

Zeyu Lu

Ying Shan

Ping Luo

MLLM VLM

145

13 May 2024

Federated Document Visual Question Answering: A Pilot Study

Khanh Nguyen

Dimosthenis Karatzas

FedML

316

10 May 2024

Exploring the Capabilities of Large Multimodal Models on Dense TextIEEE International Conference on Document Analysis and Recognition (ICDAR), 2024

Yuliang Liu

201

09 May 2024

Lightweight Spatial Modeling for Combinatorial Information Extraction From Documents

192

08 May 2024

GeoContrastNet: Contrastive Key-Value Edge Learning for Language-Agnostic Document UnderstandingIEEE International Conference on Document Analysis and Recognition (ICDAR), 2024

Nil Biescas

Carlos Boned Riera

Josep Lladós

Sanket Biswas

202

06 May 2024

What matters when building vision-language models?Neural Information Processing Systems (NeurIPS), 2024

302

276

03 May 2024

MANTIS: Interleaved Multi-Image Instruction Tuning

416

183

02 May 2024

KVP10k : A Comprehensive Dataset for Key-Value Pair Extraction in Business Documents

...

187

01 May 2024

CREPE: Coordinate-Aware End-to-End Document Parser

256

01 May 2024

Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism

251

29 Apr 2024

MileBench: Benchmarking MLLMs in Long Context

Xiang Wan

369

29 Apr 2024

ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images

Huy Quang Pham

Thang Kien-Bao Nguyen

225

29 Apr 2024

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

...

Dahua Lin

Yu Qiao

Jifeng Dai

Wenhai Wang

MLLM VLM

528

983

25 Apr 2024

An empirical study of LLaMA3 quantization: from LLMs to MLLMs

Xianglong Liu

Michele Magno

549

22 Apr 2024

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

300

19 Apr 2024

PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering

214

19 Apr 2024

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

...

467

19 Apr 2024

ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images

Quan Van Nguyen

Dan Quang Tran

Huy Quang Pham

Thang Kien-Bao Nguyen

626

16 Apr 2024

TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding

244

15 Apr 2024

TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

213

14 Apr 2024

HRVDA: High-Resolution Visual Document AssistantComputer Vision and Pattern Recognition (CVPR), 2024

Xin Li

277

10 Apr 2024

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HDNeural Information Processing Systems (NeurIPS), 2024

...

Dahua Lin

276

159

09 Apr 2024

LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding

376

08 Apr 2024

BuDDIE: A Business Document Dataset for Multi-task Information Extraction

...

217

05 Apr 2024

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Shanghang Zhang

402

29 Mar 2024

JDocQA: Japanese Document Question Answering Dataset for Generative Language Models

285

28 Mar 2024

Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models

Xiang An

201

28 Mar 2024

Can AI Models Appreciate Document Aesthetics? An Exploration of Legibility and Layout Quality in Relation to Prediction Confidence

262

27 Mar 2024

Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

Han Xiao

346

216

25 Mar 2024

Visually Guided Generative Text-Layout Pre-training for Document Intelligence

Xin Jiang

Qun Liu

Kam-Fai Wong

220

25 Mar 2024

LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding

Masato Fujitake

MLLM

175

21 Mar 2024

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

Jiabo Ye

...

Ji Zhang

Qin Jin

Fei Huang

Jingren Zhou

VLM

305

196

19 Mar 2024

From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation ModelsIEEE Transactions on Knowledge and Data Engineering (TKDE), 2024

471

18 Mar 2024

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

...

521

246

14 Mar 2024

TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

Yuliang Liu

313

150

07 Mar 2024

Transformers and Language Models in Form Understanding: A Comprehensive Review of Scanned Document Analysis

188

06 Mar 2024

Model-Based Data-Centric AI: Bridging the Divide Between Academic Ideals and Industrial Pragmatism

Chanjun Park

Minsoo Khang

Dahyun Kim

164

04 Mar 2024

InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding

Hongxia Yang

149

03 Mar 2024

Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models

Xin Li

208

29 Feb 2024

Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding

Hongshen Xu

Kai Yu

155

28 Feb 2024

Improving Language Understanding from Screenshots

201

21 Feb 2024

RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning

...

184

19 Feb 2024