v1v2v3v4v5 (latest)

Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

23 February 2023

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)Github

Papers citing "Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?"

50 / 58 papers shown

Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering

170

28 Nov 2025

ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering

261

27 Nov 2025

SciEGQA: A Dataset for Scientific Evidence-Grounded Question Answering and Reasoning

222

19 Nov 2025

HV-Attack: Hierarchical Visual Attack for Multimodal Retrieval Augmented Generation

276

19 Nov 2025

DeepEyesV2: Toward Agentic Multimodal ModelIEEE Transactions on Audio, Speech, and Language Processing (TASLP), 2025

195

07 Nov 2025

Unified Reinforcement and Imitation Learning for Vision-Language Models

225

22 Oct 2025

Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents

177

21 Oct 2025

A Comprehensive Survey on Reinforcement Learning-based Agentic Search: Foundations, Roles, Optimizations, Evaluations, and Applications

634

19 Oct 2025

Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

263

16 Oct 2025

NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching

314

15 Oct 2025

DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

306

14 Oct 2025

CFVBench: A Comprehensive Video Benchmark for Fine-grained Multimodal Retrieval-Augmented Generation

...

166

10 Oct 2025

MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval

184

10 Oct 2025

Retrv-R1: A Reasoning-Driven MLLM Framework for Universal and Efficient Multimodal Retrieval

232

03 Oct 2025

Generalized Contrastive Learning for Universal Multimodal Retrieval

234

30 Sep 2025

From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models

...

634

29 Sep 2025

Recurrence Meets Transformers for Universal Multimodal Retrieval

280

10 Sep 2025

Global-to-Local or Local-to-Global? Enhancing Image Retrieval with Efficient Local Search and Effective Global Re-ranking

248

04 Sep 2025

CMRAG: Co-modality-based visual document retrieval and question answering

315

02 Sep 2025

Multimodal Iterative RAG for Knowledge-Intensive Visual Question Answering

359

31 Aug 2025

mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering

244

07 Aug 2025

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

...

399

07 Aug 2025

On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey

370

28 Jul 2025

Augmented Vision-Language Models: A Systematic Review

222

24 Jul 2025

Taming the Untamed: Graph-Based Knowledge Retrieval and Reasoning for MLLMs to Conquer the Unknown

412

21 Jun 2025

GenRecal: Generation after Recalibration from Large to Small Vision-Language Models

382

18 Jun 2025

CoRe-MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAGAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

Yang Tian

Fan Liu

Jingyuan Zhang

Victoria A. Webster-Wood

Yupeng Hu

Liqiang Nie

VLM

303

03 Jun 2025

mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation

450

29 May 2025

Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation

367

28 May 2025

Spa-VLM: Stealthy Poisoning Attacks on RAG-based VLM

298

28 May 2025

MMTBENCH: A Unified Benchmark for Complex Multimodal Table Reasoning

Prasham Yatinkumar Titiya

282

27 May 2025

OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal RetrievalAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

445

10 May 2025

MIEB: Massive Image Embedding Benchmark

583

14 Apr 2025

Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook

418

23 Mar 2025

Poisoned-MRAG: Knowledge Poisoning Attacks to Multimodal Retrieval Augmented Generation

641

08 Mar 2025

Fine-Grained Knowledge Structuring and Retrieval for Visual Question Answering

440

28 Feb 2025

Joint Fusion and Encoding: Advancing Multimodal Retrieval from the Ground Up

1.0K

27 Feb 2025

Detecting Knowledge Boundary of Vision Large Language Models by Sampling-Based Inference

517

25 Feb 2025

Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries

335

23 Feb 2025

Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search EnginesAAAI Conference on Artificial Intelligence (AAAI), 2025

418

23 Feb 2025

LOVA3: Learning to Visual Question Answering, Asking and AssessmentNeural Information Processing Systems (NeurIPS), 2024

458

21 Feb 2025

Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented GenerationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

Mohammad Mahdi Abootorabi

Amirhosein Zobeiri

Mahdi Dehghani

Mohammadali Mohammadkhani

841

12 Feb 2025

Generative Landmarks Guided Eyeglasses Removal 3D Face Reconstruction

Dapeng Zhao

Yue Qi

3DH CVBM 3DV

373

31 Dec 2024

GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

Meishan Zhang

623

112

22 Dec 2024

Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

549

18 Dec 2024

Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning AgentInternational Conference on Learning Representations (ICLR), 2024

...

778

05 Nov 2024

MM-Embed: Universal Multimodal Retrieval with Multimodal LLMsInternational Conference on Learning Representations (ICLR), 2024

1.0K

106

04 Nov 2024

MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal ModelsInternational Conference on Learning Representations (ICLR), 2024

Pan Lu

Kai-Wei Chang

Nanyun Peng

VLM

399

10 Oct 2024

EchoSight: Advancing Visual-Language Models with Wiki Knowledge

Yibin Yan

Weidi Xie

RALM

380

17 Jul 2024

SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs

394

28 Jun 2024