ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.20964
21
0

Fine-Grained Retrieval-Augmented Generation for Visual Question Answering

28 February 2025
Zhengxuan Zhang
Yin Wu
Yuyu Luo
Nan Tang
ArXivPDFHTML
Abstract

Visual Question Answering (VQA) focuses on providing answers to natural language questions by utilizing information from images. Although cutting-edge multimodal large language models (MLLMs) such as GPT-4o achieve strong performance on VQA tasks, they frequently fall short in accessing domain-specific or the latest knowledge. To mitigate this issue, retrieval-augmented generation (RAG) leveraging external knowledge bases (KBs), referred to as KB-VQA, emerges as a promising approach. Nevertheless, conventional unimodal retrieval techniques, which translate images into textual descriptions, often result in the loss of critical visual details. This study presents fine-grained knowledge units, which merge textual snippets with entity images stored in vector databases. Furthermore, we introduce a knowledge unit retrieval-augmented generation framework (KU-RAG) that integrates fine-grained retrieval with MLLMs. The proposed KU-RAG framework ensures precise retrieval of relevant knowledge and enhances reasoning capabilities through a knowledge correction chain. Experimental findings demonstrate that our approach significantly boosts the performance of leading KB-VQA methods, achieving an average improvement of approximately 3% and up to 11% in the best case.

View on arXiv
@article{zhang2025_2502.20964,
  title={ Fine-Grained Retrieval-Augmented Generation for Visual Question Answering },
  author={ Zhengxuan Zhang and Yin Wu and Yuyu Luo and Nan Tang },
  journal={arXiv preprint arXiv:2502.20964},
  year={ 2025 }
}
Comments on this paper