Why Text Prevails: Vision May Undermine Multimodal Medical Decision Making

15 December 2025

Siyuan Dai

Lunxiao Li

Kun Zhao

Eardi Lila

Paul K. Crane

Heng Huang

Dongkuan Xu

Haoteng Tang

Liang Zhan

ArXiv (abs)PDF HTML Github

Main:3 Pages

1 Figures

Bibliography:2 Pages

Abstract

With the rapid progress of large language models (LLMs), advanced multimodal large language models (MLLMs) have demonstrated impressive zero-shot capabilities on vision-language tasks. In the biomedical domain, however, even state-of-the-art MLLMs struggle with basic Medical Decision Making (MDM) tasks. We investigate this limitation using two challenging datasets: (1) three-stage Alzheimer's disease (AD) classification (normal, mild cognitive impairment, dementia), where category differences are visually subtle, and (2) MIMIC-CXR chest radiograph classification with 14 non-mutually exclusive conditions. Our empirical study shows that text-only reasoning consistently outperforms vision-only or vision-text settings, with multimodal inputs often performing worse than text alone. To mitigate this, we explore three strategies: (1) in-context learning with reason-annotated exemplars, (2) vision captioning followed by text-only inference, and (3) few-shot fine-tuning of the vision tower with classification supervision. These findings reveal that current MLLMs lack grounded visual understanding and point to promising directions for improving multimodal decision making in healthcare.

View on arXiv

Comments on this paper