12

Med-VCD: Mitigating Hallucination for Medical Large Vision Language Models through Visual Contrastive Decoding

Zahra Mahdavi
Zahra Khodakaramimaghsoud
Hooman Khaloo
Sina Bakhshandeh Taleshani
Erfan Hashemi
Javad Mirzapour Kaleybar
Omid Nejati Manzari
Main:13 Pages
7 Figures
Bibliography:5 Pages
15 Tables
Abstract

Large vision-language models (LVLMs) are now central to healthcare applications such as medical visual question answering and imaging report generation. Yet, these models remain vulnerable to hallucination outputs that appear plausible but are in fact incorrect. In the natural image domain, several decoding strategies have been proposed to mitigate hallucinations by reinforcing visual evidence, but most rely on secondary decoding or rollback procedures that substantially slow inference. Moreover, existing solutions are often domain-specific and may introduce misalignment between modalities or between generated and ground-truth content. We introduce Med-VCD, a sparse visual-contrastive decoding method that mitigates hallucinations in medical LVLMs without the time overhead of secondary decoding. Med-VCD incorporates a novel token-sparsification strategy that selects visually informed tokens on the fly, trimming redundancy while retaining critical visual context and thus balancing efficiency with reliability. Evaluations on eight medical datasets, spanning ophthalmology, radiology, and pathology tasks in visual question answering, report generation, and dedicated hallucination benchmarks, show that Med-VCD raises factual accuracy by an average of 13\% and improves hallucination accuracy by 6\% relative to baseline medical LVLMs.

View on arXiv
Comments on this paper