MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models

10 October 2024

Pan Lu

Kai-Wei Chang

Nanyun Peng

Abstract

Existing multimodal retrieval benchmarks primarily focus on evaluating whether models can retrieve and utilize external textual knowledge for question answering. However, there are scenarios where retrieving visual information is either more beneficial or easier to access than textual data. In this paper, we introduce a multimodal retrieval-augmented generation benchmark, MRAG-Bench, in which we systematically identify and categorize scenarios where visually augmented knowledge is better than textual knowledge, for instance, more images from varying viewpoints. MRAG-Bench consists of 16,130 images and 1,353 human-annotated multiple-choice questions across 9 distinct scenarios. With MRAG-Bench, we conduct an evaluation of 10 open-source and 4 proprietary large vision-language models (LVLMs). Our results show that all LVLMs exhibit greater improvements when augmented with images compared to textual knowledge, confirming that MRAG-Bench is vision-centric. Additionally, we conduct extensive analysis with MRAG-Bench, which offers valuable insights into retrieval-augmented LVLMs. Notably, the top-performing model, GPT-4o, faces challenges in effectively leveraging retrieved knowledge, achieving only a 5.82% improvement with ground-truth information, in contrast to a 33.16% improvement observed in human participants. These findings highlight the importance of MRAG-Bench in encouraging the community to enhance LVLMs' ability to utilize retrieved visual knowledge more effectively.

View on arXiv

@article{hu2025_2410.08182,
  title={ MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models },
  author={ Wenbo Hu and Jia-Chen Gu and Zi-Yi Dou and Mohsen Fayyaz and Pan Lu and Kai-Wei Chang and Nanyun Peng },
  journal={arXiv preprint arXiv:2410.08182},
  year={ 2025 }
}

Comments on this paper