v1v2v3v4v5 (latest)

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering

20 May 2024

Mohamad Fitri Faiz Bin Mahmood

Papers citing "MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering"

50 / 82 papers shown

Jina-VLM: Small Multilingual Vision Language Model

336

03 Dec 2025

LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight

121

25 Nov 2025

DocPTBench: Benchmarking End-to-End Photographed Document Parsing and Translation

128

23 Nov 2025

VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging

224

22 Nov 2025

Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at ScaleAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2025

Prithviraj Ammanabrolu

343

07 Nov 2025

Qianfan-VL: Domain-Enhanced Universal Vision-Language Models

...

19 Sep 2025

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

...

294

249

25 Aug 2025

A Metric for MLLM Alignment in Large-scale Recommendation

121

07 Aug 2025

HW-MLVQA: Elucidating Multilingual Handwritten Document Understanding with a Comprehensive VQA Benchmark

21 Jul 2025

The Multilingual Divide and Its Impact on Global AI Safety

Aidan Peppin

Julia Kreutzer

Alice Schoenauer Sebag

...

304

27 May 2025

Chart-to-Experience: Benchmarking Multimodal LLMs for Predicting Experiential Impact of ChartsIEEE Pacific Visualization Symposium (PacificVis), 2025

271

23 May 2025

Prolonged Reasoning Is Not All You Need: Certainty-Based Adaptive Routing for Efficient LLM/MLLM Reasoning

...

393

21 May 2025

VoQA: Visual-only Question Answering

324

20 May 2025

Dolphin: Document Image Parsing via Heterogeneous Anchor PromptingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

...

324

20 May 2025

Advancing Sequential Numerical Prediction in Autoregressive ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

555

19 May 2025

Reasoning-OCR: Can Large Multimodal Models Solve Complex Logical Reasoning Problems from OCR Cues?

371

19 May 2025

LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?

570

18 May 2025

WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?

...

411

16 May 2025

PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto Language

Ijazul Haq

Yingjie Zhang

Irfan Ali Khan

320

15 May 2025

Seed1.5-VL Technical Report

...

211

165

11 May 2025

Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks

...

548

26 Apr 2025

Benchmarking Vision Language Models on German Factual DataArtificial Intelligence Applications and Innovations (AIAI), 2025

René Peinl

Vincent Tischler

CoGe

339

15 Apr 2025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

...

555

790

14 Apr 2025

XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?Computer Vision and Pattern Recognition (CVPR), 2025

...

324

31 Mar 2025

TDRI: Two-Phase Dialogue Refinement and Co-Adaptation for Interactive Image Generation

1.2K

22 Mar 2025

PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks

453

06 Mar 2025

Task-Oriented 6-DoF Grasp Pose Detection in CluttersIEEE International Conference on Robotics and Automation (ICRA), 2025

329

24 Feb 2025

Cross-Modal Synergies: Unveiling the Potential of Motion-Aware Fusion Networks in Handling Dynamic and Static ReID Scenarios

432

02 Feb 2025

Detection of AI Deepfake and Fraud in Online Payments Using GAN-Based Models

310

13 Jan 2025

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

...

355

31 Dec 2024

Attentive Eraser: Unleashing Diffusion Model's Object Removal Potential via Self-Attention Redirection GuidanceAAAI Conference on Artificial Intelligence (AAAI), 2024

779

17 Dec 2024

MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark

241

15 Oct 2024

SELU: Self-Learning Embodied MLLMs in Unknown Environments

Boyu Li

Haoran Li

Zongqing Lu

188

04 Oct 2024

A Survey on Multimodal Benchmarks: In the Era of Large AI Models

Lin Li

Guikun Chen

Hanrong Shi

Jun Xiao

Long Chen

343

21 Sep 2024

A Survey on Evaluation of Multimodal Large Language Models

Jiaxing Huang

Jingyi Zhang

LM&MA ELM LRM

303

28 Aug 2024

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?International Conference on Learning Representations (ICLR), 2024

Yi-Fan Zhang

Huanyu Zhang

Haochen Tian

Chaoyou Fu

Shuangqing Zhang

...

Qingsong Wen

Zhang Zhang

Liwen Wang

Rong Jin

Tieniu Tan

OffRL

363

134

23 Aug 2024

ParGo: Bridging Vision-Language with Partial and Global ViewsAAAI Conference on Artificial Intelligence (AAAI), 2024

514

23 Aug 2024

Contextual Bandits for Unbounded Context Distributions

540

19 Aug 2024

Harmonizing Visual Text Comprehension and Generation

Yuan Xie

320

23 Jul 2024

IMAGDressing-v1: Customizable Virtual Dressing

Zechao Li

273

100

17 Jul 2024

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

...

724

356

16 Jul 2024

A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding

...

623

02 Jul 2024

DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming

371

27 Jun 2024

CVQA: Culturally-diverse Multilingual Visual Question Answering BenchmarkNeural Information Processing Systems (NeurIPS), 2024

David Romero

Chenyang Lyu

Haryo Akbarianto Wibowo

...

321

10 Jun 2024

TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy

Qi Liu

...

276

03 Jun 2024

ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images

Huy Quang Pham

Thang Kien-Bao Nguyen

217

29 Apr 2024

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

...

Dahua Lin

Yu Qiao

Jifeng Dai

Wenhai Wang

MLLM VLM

522

981

25 Apr 2024

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

...

459

19 Apr 2024

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HDNeural Information Processing Systems (NeurIPS), 2024

...

Dahua Lin

267

159

09 Apr 2024

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

406

325

27 Mar 2024