Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
2007.00398
Cited By

DocVQA: A Dataset for VQA on Document Images

v1v2v3 (latest)

DocVQA: A Dataset for VQA on Document Images

1 July 2020

Dimosthenis Karatzas

ArXiv (abs)PDF HTML HuggingFace (2 upvotes)

Papers citing "DocVQA: A Dataset for VQA on Document Images"

50 / 759 papers shown

Interpret, prune and distill Donut : towards lightweight VLMs for VQA on document

Interpret, prune and distill Donut : towards lightweight VLMs for VQA on document

Adnan Ben Mansour

138

0

0

30 Sep 2025

Defeating Cerberus: Concept-Guided Privacy-Leakage Mitigation in Multimodal Language Models

Defeating Cerberus: Concept-Guided Privacy-Leakage Mitigation in Multimodal Language Models

Istemi Ekin Akkus

195

0

0

29 Sep 2025

From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models

From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models

...

455

10

0

29 Sep 2025

OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding

OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding

90

0

0

29 Sep 2025

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

...

Jiankang Deng

372

46

0

28 Sep 2025

Visual CoT Makes VLMs Smarter but More Fragile

Visual CoT Makes VLMs Smarter but More Fragile

150

0

0

28 Sep 2025

RIV: Recursive Introspection Mask Diffusion Vision Language Model

RIV: Recursive Introspection Mask Diffusion Vision Language Model

85

1

0

28 Sep 2025

LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models

LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models

Shubhang Bhatnagar

215

0

0

28 Sep 2025

Uncovering Intrinsic Capabilities: A Paradigm for Data Curation in Vision-Language Models

Uncovering Intrinsic Capabilities: A Paradigm for Data Curation in Vision-Language Models

127

0

0

27 Sep 2025

SynDoc: A Hybrid Discriminative-Generative Framework for Enhancing Synthetic Domain-Adaptive Document Key Information Extraction

SynDoc: A Hybrid Discriminative-Generative Framework for Enhancing Synthetic Domain-Adaptive Document Key Information Extraction

Soyeon Caren Han

105

0

0

27 Sep 2025

Chimera: Diagnosing Shortcut Learning in Visual-Language Understanding

Chimera: Diagnosing Shortcut Learning in Visual-Language Understanding

Mubashara Akhtar

Mrinmaya Sachan

143

0

0

26 Sep 2025

Rule-Based Reinforcement Learning for Document Image Classification with Vision Language Models

Rule-Based Reinforcement Learning for Document Image Classification with Vision Language Models

Andreas Fischer

87

0

0

26 Sep 2025

MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

...

130

11

0

25 Sep 2025

DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning

DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning

102

0

0

25 Sep 2025

TABLET: A Large-Scale Dataset for Robust Visual Table Understanding

TABLET: A Large-Scale Dataset for Robust Visual Table Understanding

395

0

0

25 Sep 2025

CHURRO: Making History Readable with an Open-Weight Large Vision-Language Model for High-Accuracy, Low-Cost Historical Text Recognition

CHURRO: Making History Readable with an Open-Weight Large Vision-Language Model for High-Accuracy, Low-Cost Historical Text Recognition

Sina J. Semnani

Merve Tekgürler

137

0

0

24 Sep 2025

A Simple Data Augmentation Strategy for Text-in-Image Scientific VQA

A Simple Data Augmentation Strategy for Text-in-Image Scientific VQA

Yova Kementchedjhieva

50

0

0

24 Sep 2025

Rule Encoding and Compliance in Large Language Models: An Information-Theoretic Analysis

Rule Encoding and Compliance in Large Language Models: An Information-Theoretic Analysis

Joachim Diederich

204

0

0

23 Sep 2025

Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards

Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards

191

1

0

23 Sep 2025

Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception

Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception

269

0

0

21 Sep 2025

Qianfan-VL: Domain-Enhanced Universal Vision-Language Models

Qianfan-VL: Domain-Enhanced Universal Vision-Language Models

...

99

1

0

19 Sep 2025

Pointing to a Llama and Call it a Camel: On the Sycophancy of Multimodal Large Language Models

Pointing to a Llama and Call it a Camel: On the Sycophancy of Multimodal Large Language Models

105

2

0

19 Sep 2025

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

...

Zhengdong Zhang

205

4

0

19 Sep 2025

Towards Rationale-Answer Alignment of LVLMs via Self-Rationale Calibration

Towards Rationale-Answer Alignment of LVLMs via Self-Rationale Calibration

111

1

0

17 Sep 2025

SAIL-VL2 Technical Report

SAIL-VL2 Technical Report

...

301

4

0

17 Sep 2025

AsyMoE: Leveraging Modal Asymmetry for Enhanced Expert Specialization in Large Vision-Language Models

AsyMoE: Leveraging Modal Asymmetry for Enhanced Expert Specialization in Large Vision-Language Models

...

354

0

0

16 Sep 2025

3D Aware Region Prompted Vision Language Model

3D Aware Region Prompted Vision Language Model

...

Pavlo Molchanov

139

8

0

16 Sep 2025

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

...

209

25

0

16 Sep 2025

MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs

MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs

333

1

0

15 Sep 2025

PATIMT-Bench: A Multi-Scenario Benchmark for Position-Aware Text Image Machine Translation in Large Vision-Language Models

PATIMT-Bench: A Multi-Scenario Benchmark for Position-Aware Text Image Machine Translation in Large Vision-Language Models

142

0

0

14 Sep 2025

Towards Reliable and Interpretable Document Question Answering via VLMs

Towards Reliable and Interpretable Document Question Answering via VLMs

Simone Giovannini

193

0

0

12 Sep 2025

VARCO-VISION-2.0 Technical Report

VARCO-VISION-2.0 Technical Report

219

2

0

12 Sep 2025

Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning

Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning

...

95

4

0

10 Sep 2025

Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images

Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images

Boammani Aser Lompo

LMTD ReLM VLM LRM

132

1

0

09 Sep 2025

Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models

Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models

98

0

0

08 Sep 2025

MoLoRAG: Bootstrapping Document Understanding via Multi-modal Logic-aware Retrieval

MoLoRAG: Bootstrapping Document Understanding via Multi-modal Logic-aware Retrieval

139

4

0

06 Sep 2025

OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation

OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation

366

13

0

03 Sep 2025

VLMs-in-the-Wild: Bridging the Gap Between Academic Benchmarks and Enterprise Reality

VLMs-in-the-Wild: Bridging the Gap Between Academic Benchmarks and Enterprise Reality

Srihari Bandraupalli

72

1

0

03 Sep 2025

A-SEA3L-QA: A Fully Automated Self-Evolving, Adversarial Workflow for Arabic Long-Context Question-Answer Generation

A-SEA3L-QA: A Fully Automated Self-Evolving, Adversarial Workflow for Arabic Long-Context Question-Answer Generation

Daulet Toibazar

Pedro J. Moreno

82

0

0

02 Sep 2025

MoPEQ: Mixture of Mixed Precision Quantized Experts

MoPEQ: Mixture of Mixed Precision Quantized Experts

Krishna Teja Chitty-Venkata

98

2

0

02 Sep 2025

CAD2DMD-SET: Synthetic Generation Tool of Digital Measurement Device CAD Model Datasets for fine-tuning Large Vision-Language Models

CAD2DMD-SET: Synthetic Generation Tool of Digital Measurement Device CAD Model Datasets for fine-tuning Large Vision-Language Models

Rodrigo Ventura

107

1

0

29 Aug 2025

SUMMA: A Multimodal Large Language Model for Advertisement Summarization

SUMMA: A Multimodal Large Language Model for Advertisement Summarization

135

0

0

28 Aug 2025

R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

208

5

0

28 Aug 2025

KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts

KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts

161

1

0

27 Aug 2025

Extracting Information from Scientific Literature via Visual Table Question Answering Models

Extracting Information from Scientific Literature via Visual Table Question Answering Models

94

0

0

26 Aug 2025

Enhancing Document VQA Models via Retrieval-Augmented Generation

Enhancing Document VQA Models via Retrieval-Augmented Generation

Artemis LLabres

222

1

0

26 Aug 2025

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

...

305

298

0

25 Aug 2025

Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs

Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs

Abhirama Subramanyam Penamakuri

Abhishek Bhandari

267

2

0

24 Aug 2025

MoE-Inference-Bench: Performance Evaluation of Mixture of Expert Large Language and Vision Models

MoE-Inference-Bench: Performance Evaluation of Mixture of Expert Large Language and Vision Models

Krishna Teja Chitty-Venkata

Natalia Vassilieva

Siddhisanket Raskar

121

1

0

24 Aug 2025

Explain Before You Answer: A Survey on Compositional Visual Reasoning

Explain Before You Answer: A Survey on Compositional Visual Reasoning

...

Gholamreza Haffari

364

10

0

24 Aug 2025

1 2 3 4 5 6...14 15 16

Page 3 of 16

Pageof 16