v1v2v3 (latest)

DocVQA: A Dataset for VQA on Document Images

1 July 2020

Minesh Mathew

Dimosthenis Karatzas

C. V. Jawahar

ArXiv (abs)PDF HTML HuggingFace (2 upvotes)

Papers citing "DocVQA: A Dataset for VQA on Document Images"

50 / 759 papers shown

Lumos : Empowering Multimodal LLMs with Scene Text Recognition

...

Anuj Kumar

214

12 Feb 2024

Question Aware Vision Transformer for Multimodal Reasoning

299

08 Feb 2024

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

...

Yu Qiao

512

139

08 Feb 2024

TreeForm: End-to-end Annotation and Evaluation for Form Document Parsing

192

07 Feb 2024

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

846

07 Feb 2024

ANLS* -- A Universal Document Processing Metric for Generative Large Language Models

305

06 Feb 2024

Can MLLMs Perform Text-to-Image In-Context Learning?

263

02 Feb 2024

Instruction Makes a Difference

286

01 Feb 2024

LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs

404

29 Jan 2024

Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQAAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

291

29 Jan 2024

LongFin: A Multimodal Document Understanding Model for Long Financial Domain Documents

Ahmed Masry

Amir Hajian

144

26 Jan 2024

MM-LLMs: Recent Advances in MultiModal Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

512

333

24 Jan 2024

InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with InstructionsAAAI Conference on Artificial Intelligence (AAAI), 2024

258

24 Jan 2024

Small Language Model Meets with Reinforced Vision Vocabulary

Haoran Wei

Lingyu Kong

Jinyue Chen

Liang Zhao

Zheng Ge

En Yu

Jian‐Yuan Sun

Chunrui Han

Xiangyu Zhang

VLM

239

23 Jan 2024

InfiAgent-DABench: Evaluating Agents on Data Analysis TasksInternational Conference on Machine Learning (ICML), 2024

...

Jiwei Li

Hongxia Yang

268

10 Jan 2024

Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning

Jianbo Yuan

Hongxia Yang

308

146

10 Jan 2024

GRAM: Global Reasoning for Multi-Page VQA

231

07 Jan 2024

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

314

06 Jan 2024

DocGraphLM: Documental Graph Language Model for Information Extraction

Dongsheng Wang

165

05 Jan 2024

DocLLM: A layout-aware generative language model for multimodal document understandingAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

276

106

31 Dec 2023

An Empirical Study of Scaling Law for OCR

430

29 Dec 2023

Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey

195

27 Dec 2023

Privacy-Aware Document Visual Question AnsweringIEEE International Conference on Document Analysis and Recognition (ICDAR), 2023

...

219

15 Dec 2023

Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language ModelsEuropean Conference on Computer Vision (ECCV), 2023

Jinjin Gu

394

14 Dec 2023

Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models

Haoran Wei

Lingyu Kong

Jinyue Chen

Liang Zhao

Zheng Ge

Jinrong Yang

Jian‐Yuan Sun

Chunrui Han

Xiangyu Zhang

MLLM VLM

271

11 Dec 2023

Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects

225

08 Dec 2023

Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank ExpertsComputer Vision and Pattern Recognition (CVPR), 2023

Jialin Wu

Yaqing Wang

258

01 Dec 2023

mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language ModelACM Multimedia (ACM MM), 2023

Jiabo Ye

Ji Zhang

Fei Huang

MLLM

255

30 Nov 2023

MVBench: A Comprehensive Multi-modal Video Understanding BenchmarkComputer Vision and Pattern Recognition (CVPR), 2023

...

Ping Luo

Yu Qiao

664

857

28 Nov 2023

Fully Authentic Visual Question Answering Dataset from Online CommunitiesEuropean Conference on Computer Vision (ECCV), 2023

Chongyan Chen

Xiyang Dai

Noel Codella

Yunsheng Li

Lu Yuan

Danna Gurari

373

27 Nov 2023

Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs

Yunxin Li

Zhenyu Liu

Wei Wang

Xiaochun Cao

Yuxin Ding

Xiaochun Cao

Min Zhang

181

27 Nov 2023

EIGEN: Expert-Informed Joint Learning Aggregation for High-Fidelity Information Extraction from Document Images

A. Singh

Venkatapathy Subramanian

Ayush Maheshwari

Pradeep Narayan

D. P. Shetty

Ganesh Ramakrishnan

126

23 Nov 2023

Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs

Hao Feng

298

22 Nov 2023

DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding

Hao Feng

Qi Liu

344

20 Nov 2023

Efficient End-to-End Visual Document Understanding with Rationale Distillation

Robin Jia

151

16 Nov 2023

MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction TuningNorth American Chapter of the Association for Computational Linguistics (NAACL), 2023

Fuxiao Liu

Wenlin Yao

Dong Yu

220

163

15 Nov 2023

DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder Transformer Models

226

15 Nov 2023

Multiple-Question Multiple-Answer Text-VQANorth American Chapter of the Association for Computational Linguistics (NAACL), 2023

211

15 Nov 2023

What Large Language Models Bring to Text-rich VQA?

Wei Tang

142

13 Nov 2023

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal ModelsComputer Vision and Pattern Recognition (CVPR), 2023

Yuliang Liu

492

382

11 Nov 2023

OtterHD: A High-Resolution Multi-modality Model

Ziwei Liu

187

07 Nov 2023

From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and OpportunitiesInformation Fusion (Inf. Fusion), 2023

Md Farhan Ishmam

Md Sakib Hossain Shovon

M. F. Mridha

Nilanjan Dey

399

01 Nov 2023

Enhancing Document Information Analysis with Multi-Task Pre-training: A Robust Approach for Information Extraction in Visually-Rich DocumentsIEEE International Joint Conference on Neural Network (IJCNN), 2023

Tofik Ali

Partha Pratim Roy

207

25 Oct 2023

A Multi-Modal Multilingual Benchmark for Document Image ClassificationConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

240

25 Oct 2023

Non-Intrusive Adaptation: Input-Centric Parameter-efficient Fine-Tuning for Versatile Multimodal Modeling

Yaqing Wang

Jialin Wu

...

182

18 Oct 2023

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

Jialin Wu

...

295

139

13 Oct 2023

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language ModelConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Jiabo Ye

...

Ji Zhang

Qin Jin

334

125

08 Oct 2023

ReForm-Eval: Evaluating Large Vision Language Models via Unified Re-Formulation of Task-Oriented BenchmarksACM Multimedia (ACM MM), 2023

...

Xuanjing Huang

303

04 Oct 2023

GridFormer: Towards Accurate Table Structure Recognition via Grid PredictionACM Multimedia (ACM MM), 2023

Jingdong Wang

286

26 Sep 2023

Analyzing the Efficacy of an LLM-Only Approach for Image-based Document Question Answering

223

25 Sep 2023