ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.17297
33
0

Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts

24 February 2025
Zhenghao Liu
Xingsheng Zhu
Tianshuo Zhou
Xinyi Zhang
Xiaoyuan Yi
Yukun Yan
Yu Gu
Ge Yu
Maosong Sun
    RALM
    VLM
ArXivPDFHTML
Abstract

This paper introduces Multi-Modal Retrieval-Augmented Generation (M^2RAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models (MLLMs) in leveraging knowledge from multi-modal retrieval documents. The benchmark comprises four tasks: image captioning, multi-modal question answering, multi-modal fact verification, and image reranking. All tasks are set in an open-domain setting, requiring RAG models to retrieve query-relevant information from a multi-modal document collection and use it as input context for RAG modeling. To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT), an instruction tuning method that optimizes MLLMs within multi-modal contexts. Our experiments show that MM-RAIT improves the performance of RAG systems by enabling them to effectively learn from multi-modal contexts. All data and code are available atthis https URL.

View on arXiv
@article{liu2025_2502.17297,
  title={ Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts },
  author={ Zhenghao Liu and Xingsheng Zhu and Tianshuo Zhou and Xinyi Zhang and Xiaoyuan Yi and Yukun Yan and Yu Gu and Ge Yu and Maosong Sun },
  journal={arXiv preprint arXiv:2502.17297},
  year={ 2025 }
}
Comments on this paper