Multimodal retrieval augmented generation (M-RAG) has recently emerged as a method to inhibit hallucinations of large multimodal models (LMMs) through a factual knowledge base (KB). However, M-RAG also introduces new attack vectors for adversaries that aim to disrupt the system by injecting malicious entries into the KB. In this work, we present a poisoning attack against M-RAG targeting visual document retrieval applications, where the KB contains images of document pages. Our objective is to craft a single image that is retrieved for a variety of different user queries, and consistently influences the output produced by the generative model, thus creating a universal denial-of-service (DoS) attack against the M-RAG system. We demonstrate that while our attack is effective against a diverse range of widely-used, state-of-the-art retrievers (embedding models) and generators (LMMs), it can also be ineffective against robust embedding models. Our attack not only highlights the vulnerability of M-RAG pipelines to poisoning attacks, but also sheds light on a fundamental weakness that potentially hinders their performance even in benign settings.
View on arXiv@article{shereen2025_2504.02132, title={ One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image }, author={ Ezzeldin Shereen and Dan Ristea and Burak Hasircioglu and Shae McFadden and Vasilios Mavroudis and Chris Hicks }, journal={arXiv preprint arXiv:2504.02132}, year={ 2025 } }