Recently, multimodal large language models (MLLMs) have demonstrated remarkable performance in visual-language tasks. However, the authenticity of the responses generated by MLLMs is often compromised by object hallucinations. We identify that a key cause of these hallucinations is the model's over-susceptibility to specific image frequency features in detecting objects. In this paper, we introduce Multi-Frequency Perturbations (MFP), a simple, cost-effective, and pluggable method that leverages both low-frequency and high-frequency features of images to perturb visual feature representations and explicitly suppress redundant frequency-domain features during inference, thereby mitigating hallucinations. Experimental results demonstrate that our method significantly mitigates object hallucinations across various model architectures. Furthermore, as a training-time method, MFP can be combined with inference-time methods to achieve state-of-the-art performance on the CHAIR benchmark.
View on arXiv@article{li2025_2503.14895, title={ Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations }, author={ Shuo Li and Jiajun Sun and Guodong Zheng and Xiaoran Fan and Yujiong Shen and Yi Lu and Zhiheng Xi and Yuming Yang and Wenming Tan and Tao Ji and Tao Gui and Qi Zhang and Xuanjing Huang }, journal={arXiv preprint arXiv:2503.14895}, year={ 2025 } }