v1v2v3 (latest)

DocVQA: A Dataset for VQA on Document Images

1 July 2020

Minesh Mathew

Dimosthenis Karatzas

C. V. Jawahar

ArXiv (abs)PDF HTML HuggingFace (2 upvotes)

Papers citing "DocVQA: A Dataset for VQA on Document Images"

50 / 759 papers shown

SCOB: Universal Text Understanding via Character-wise Supervised Contrastive Learning with Online Text Rendering for Bridging Domain GapIEEE International Conference on Computer Vision (ICCV), 2023

279

21 Sep 2023

Kosmos-2.5: A Multimodal Literate Model

...

260

20 Sep 2023

PDFTriage: Question Answering over Long, Structured DocumentsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

264

16 Sep 2023

Long-Range Transformer Architectures for Document Understanding

177

11 Sep 2023

ImageBind-LLM: Multi-modality Instruction Tuning

...

Yu Qiao

276

152

07 Sep 2023

Understanding Video Scenes through Text: Insights from Text-based Video Question Answering

194

04 Sep 2023

Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region ConcentrationIEEE International Conference on Computer Vision (ICCV), 2023

200

03 Sep 2023

Distraction-free Embeddings for Robust VQA

198

31 Aug 2023

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai

Shuai Bai

Shusheng Yang

Shijie Wang

Sinan Tan

Peng Wang

Junyang Lin

Chang Zhou

Jingren Zhou

MLLM VLM ObjD

513

1,565

24 Aug 2023

InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4

Lichao Sun

296

23 Aug 2023

Knowledge Graph Prompting for Multi-Document Question AnsweringAAAI Conference on Artificial Intelligence (AAAI), 2023

514

225

22 Aug 2023

DocPrompt: Large-scale continue pretrain for zero-shot and few-shot document question answering

Sijin Wu

Dan Zhang

Teng Hu

Shikun Feng

21 Aug 2023

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual QuestionsAAAI Conference on Artificial Intelligence (AAAI), 2023

347

188

19 Aug 2023

Enhancing Visually-Rich Document Understanding via Layout Structure ModelingACM Multimedia (ACM MM), 2023

Bo Du

147

15 Aug 2023

Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative InstructionsInternational Conference on Learning Representations (ICLR), 2023

Wei Ji

312

08 Aug 2023

Tiny LVLM-eHub: Early Multimodal Experiments with BardIEEE Transactions on Big Data (IEEE Trans. Big Data), 2023

...

Ping Luo

207

07 Aug 2023

RealCQA: Scientific Chart Question Answering as a Test-bed for First-Order LogicIEEE International Conference on Document Analysis and Recognition (ICDAR), 2023

Venugopal Govindaraju

154

03 Aug 2023

Making the V in Text-VQA Matter

181

01 Aug 2023

HiREN: Towards Higher Supervision Quality for Better Scene Text Image Super-Resolution

261

31 Jul 2023

Workshop on Document Intelligence UnderstandingInternational Conference on Information and Knowledge Management (CIKM), 2023

117

31 Jul 2023

ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction TuningInternational Joint Conference on Artificial Intelligence (IJCAI), 2023

Liang Zhao

...

Yuang Peng

Chunrui Han

Xiangyu Zhang

MLLM LRM

169

18 Jul 2023

PAT: Parallel Attention Transformer for Visual Question Answering in VietnameseInternational Conference on Multimedia Analysis and Pattern Recognition (ICMAPR), 2023

Nghia Hieu Nguyen

Kiet Van Nguyen

208

17 Jul 2023

Reading Between the Lanes: Text VideoQA on the RoadIEEE International Conference on Document Analysis and Recognition (ICDAR), 2023

273

08 Jul 2023

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Jiabo Ye

...

Ji Zhang

225

155

04 Jul 2023

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

Jiuxiang Gu

Diyi Yang

316

283

29 Jun 2023

Natural Language Generation for Advertising: A Survey

Soichiro Murakami

Sho Hoshino

Peinan Zhang

185

22 Jun 2023

On Evaluation of Document Classification using RVL-CDIP

Stefan Larson

Gordon Lim

Kevin Leach

262

21 Jun 2023

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language ModelsIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

Yu Qiao

Ping Luo

ELM MLLM

309

230

15 Jun 2023

DocumentNet: Bridging the Data Gap in Document Pre-TrainingConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Alexander G. Hauptmann

H. Dai

Wei Wei

15 Jun 2023

^3

IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning

Lei Li

Yuwei Yin

Shicheng Li

Liang Chen

Peiyi Wang

...

Yazheng Yang

Jingjing Xu

Xu Sun

Lingpeng Kong

Qi Liu

MLLM VLM

376

136

07 Jun 2023

Do-GOOD: Towards Distribution Shift Evaluation for Pre-Trained Visual Document Understanding ModelsAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2023

162

05 Jun 2023

DocFormerv2: Local Features for Document UnderstandingAAAI Conference on Artificial Intelligence (AAAI), 2023

247

02 Jun 2023

Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering

403

01 Jun 2023

PaLI-X: On Scaling up a Multilingual Vision and Language Model

...

Mojtaba Seyedhosseini

334

252

29 May 2023

DAPR: A Benchmark on Document-Aware Passage RetrievalAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Kexin Wang

Nils Reimers

Iryna Gurevych

351

23 May 2023

Sequence-to-Sequence Pre-training with Unified Modality Masking for Visual Document Understanding

105

16 May 2023

On the Hidden Mystery of OCR in Large Multimodal ModelsScience China Information Sciences (Sci China Inf Sci), 2023

Yuliang Liu

Lianwen Jin

405

117

13 May 2023

OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual Question Answering in VietnameseInformation Fusion (Inf. Fusion), 2023

194

07 May 2023

Doc2SoarGraph: Discrete Reasoning over Visually-Rich Table-Text Documents via Semantic-Oriented Hierarchical GraphsInternational Conference on Language Resources and Evaluation (LREC), 2023

217

03 May 2023

SelfDocSeg: A Self-Supervised vision-based Approach towards Document SegmentationIEEE International Conference on Document Analysis and Recognition (ICDAR), 2023

Sanket Biswas

Josep Lladós

176

01 May 2023

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

...

Conghui He

Yu Qiao

288

704

28 Apr 2023

Information Redundancy and Biases in Public Document Information Extraction BenchmarksIEEE International Conference on Document Analysis and Recognition (ICDAR), 2023

S. Laatiri

Pirashanth Ratnamogan

144

28 Apr 2023

MPMQA: Multimodal Question Answering on Product ManualsAAAI Conference on Artificial Intelligence (AAAI), 2023

Liangfu Zhang

Anwen Hu

Jing Zhang

Shuo Hu

Qin Jin

192

19 Apr 2023

Deep Unrestricted Document Image RectificationIEEE transactions on multimedia (IEEE TMM), 2023

Hao Feng

298

18 Apr 2023

A Question-Answering Approach to Key Value Pair Extraction from Form-like Document ImagesAAAI Conference on Artificial Intelligence (AAAI), 2023

199

17 Apr 2023

PDFVQA: A New Dataset for Real-World VQA on PDF Documents

402

13 Apr 2023

Conditional Adapters: Parameter-efficient Transfer Learning with Fast InferenceNeural Information Processing Systems (NeurIPS), 2023

Joshua Ainslie

...

220

11 Apr 2023

Efficient OCR for Building a Diverse Digital HistoryAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Jacob Carlson

Tom Bryan

Melissa Dell

237

05 Apr 2023

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Yu Qiao

584

938

28 Mar 2023

TabIQA: Table Questions Answering on Business Document Images

212

27 Mar 2023