v1v2 (latest)

On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering

Computer Vision and Pattern Recognition (CVPR), 2020

24 February 2020

Xinyu Wang

Yuliang Liu

Chunhua Shen

Lianwen Jin

Papers citing "On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering"

50 / 74 papers shown

LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight

180

25 Nov 2025

MGA-VQA: Secure and Interpretable Graph-Augmented Visual Question Answering with Memory-Guided Protection Against Unauthorized Knowledge Use

Ahmad Mohammadshirazi

Pinaki Prasad Guha Neogi

Dheeraj Kulshrestha

R. Ramnath

148

22 Nov 2025

NVIDIA Nemotron Nano V2 VL

Nvidia

Amala Sanjay Deshmukh

...

412

06 Nov 2025

FineVision: Open Data Is All You Need

Aritra Roy Gosthipaty

Andrés Marafioti

VLM

248

20 Oct 2025

Vision Language Models Are Not (Yet) Spelling Correctors

Junhong Liang

Bojun Zhang

VLM

104

22 Sep 2025

MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs

466

15 Sep 2025

Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs

Somraj Gautam

Abhirama Subramanyam Penamakuri

Abhishek Bhandari

Gaurav Harit

LMTD LRM

340

24 Aug 2025

ExpVG: Investigating the Design Space of Visual Grounding in Multimodal Large Language Model

184

11 Aug 2025

Gather and Trace: Rethinking Video TextVQA from an Instance-oriented Perspective

236

06 Aug 2025

MLLM-CTBench: A Benchmark for Continual Instruction Tuning with Reasoning Process Diagnosis

196

31 Jul 2025

SATORI-R1: Incentivizing Multimodal Reasoning through Explicit Visual Anchoring

550

25 May 2025

One RL to See Them All: Visual Triple Unified Reinforcement Learning

526

23 May 2025

PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto Language

Ijazul Haq

Yingjie Zhang

Irfan Ali Khan

386

15 May 2025

Adaptive Markup Language Generation for Contextually-Grounded Visual Document UnderstandingComputer Vision and Pattern Recognition (CVPR), 2025

...

290

08 May 2025

Capybara-OMNI: An Efficient Paradigm for Building Omni-Modal Language Models

391

10 Apr 2025

Data Metabolism: An Efficient Data Design Schema For Vision Language Model

410

10 Apr 2025

Marten: Visual Question Answering with Mask Generation for Multi-modal Document UnderstandingComputer Vision and Pattern Recognition (CVPR), 2025

326

18 Mar 2025

PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks

484

06 Mar 2025

Are Large Vision Language Models Good Game Players?International Conference on Learning Representations (ICLR), 2025

311

04 Mar 2025

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language EmbeddingComputer Vision and Pattern Recognition (CVPR), 2024

...

598

20 Dec 2024

LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer

...

420

18 Dec 2024

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language ModelsComputer Vision and Pattern Recognition (CVPR), 2024

295

12 Dec 2024

DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness

Ahmad Mohammadshirazi

Pinaki Prasad Guha Neogi

Ser-Nam Lim

R. Ramnath

510

29 Nov 2024

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile DevicesComputer Vision and Pattern Recognition (CVPR), 2024

...

241

16 Nov 2024

SimpsonsVQA: Enhancing Inquiry-Based Learning with a Tailored Dataset

Ngoc Dung Huynh

Mohamed Reda Bouadjenek

Sunil Aryal

Imran Razzak

Hakim Hacid

263

30 Oct 2024

NVLM: Open Frontier-Class Multimodal LLMs

Wenliang Dai

Zihan Liu

363

127

17 Sep 2024

ACTRESS: Active Retraining for Semi-supervised Visual Grounding

Weitai Kang

Mengxue Qu

Yunchao Wei

Yan Yan

382

03 Jul 2024

Visual Grounding with Attention-Driven Constraint Balancing

Weitai Kang

330

03 Jul 2024

SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding

Weitai Kang

Gaowen Liu

Mubarak Shah

Yan Yan

ObjD

475

03 Jul 2024

Tri-VQA: Triangular Reasoning Medical Visual Question Answering for Multi-Attribute Analysis

241

21 Jun 2024

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Zhe Chen

...

Dahua Lin

Yu Qiao

Botian Shi

Conghui He

Jifeng Dai

VLM OffRL

349

12 Jun 2024

VCR: A Task for Pixel-Level Complex Reasoning in Vision Language Models via Restoring Occluded TextInternational Conference on Learning Representations (ICLR), 2024

Tianyu Zhang

Ge Zhang

347

10 Jun 2024

The Evolution of Multimodal Model Architectures

407

28 May 2024

Exploring the Capabilities of Large Multimodal Models on Dense TextIEEE International Conference on Document Analysis and Recognition (ICDAR), 2024

Yuliang Liu

242

09 May 2024

Instruction Makes a Difference

360

01 Feb 2024

MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer

...

Yu Qiao

314

18 Jan 2024

ModaVerse: Efficiently Transforming Modalities with LLMsComputer Vision and Pattern Recognition (CVPR), 2024

Xinyu Wang

Bohan Zhuang

Qi Wu

290

12 Jan 2024

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Weijie Su

...

Ping Luo

Yu Qiao

795

2,644

21 Dec 2023

Multi-Clue Reasoning with Memory Augmentation for Knowledge-based Visual Question Answering

228

20 Dec 2023

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal ModelsComputer Vision and Pattern Recognition (CVPR), 2023

Yuliang Liu

626

423

11 Nov 2023

Exploring Sparse Spatial Relation in Graph Inference for Text-Based VQA

Jia Li

340

13 Oct 2023

Separate and Locate: Rethink the Text in Text-based Visual Question AnsweringACM Multimedia (ACM MM), 2023

369

31 Aug 2023

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual QuestionsAAAI Conference on Artificial Intelligence (AAAI), 2023

478

207

19 Aug 2023

Advancing Visual Grounding with Scene Knowledge: Benchmark and MethodComputer Vision and Pattern Recognition (CVPR), 2023

Xiang Wan

258

21 Jul 2023

On the Hidden Mystery of OCR in Large Multimodal ModelsScience China Information Sciences (Sci China Inf Sci), 2023

Yuliang Liu

Lianwen Jin

513

117

13 May 2023

MPMQA: Multimodal Question Answering on Product ManualsAAAI Conference on Artificial Intelligence (AAAI), 2023

Liangfu Zhang

Anwen Hu

Jing Zhang

Shuo Hu

Qin Jin

236

19 Apr 2023

PDFVQA: A New Dataset for Real-World VQA on PDF Documents

492

13 Apr 2023

Fully and Weakly Supervised Referring Expression Segmentation with End-to-End Learning

256

17 Dec 2022

Hierarchical multimodal transformers for Multi-Page DocVQAPattern Recognition (Pattern Recogn.), 2022

Rubèn Pérez Tito

Dimosthenis Karatzas

Ernest Valveny

300

103

07 Dec 2022

VLG: General Video Recognition with Web Textual KnowledgeInternational Journal of Computer Vision (IJCV), 2022

380

03 Dec 2022