Learning Contrastive Multimodal Fusion with Improved Modality Dropout for Disease Detection and Prediction

International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2025

22 September 2025

Yi Gu

Kuniaki Saito

Jiaxin Ma

ArXiv (abs)PDF HTML Github

Main:8 Pages

1 Figures

Bibliography:2 Pages

3 Tables

Abstract

As medical diagnoses increasingly leverage multimodal data, machine learning models are expected to effectively fuse heterogeneous information while remaining robust to missing modalities. In this work, we propose a novel multimodal learning framework that integrates enhanced modalities dropout and contrastive learning to address real-world limitations such as modality imbalance and missingness. Our approach introduces learnable modality tokens for improving missingness-aware fusion of modalities and augments conventional unimodal contrastive objectives with fused multimodal representations. We validate our framework on large-scale clinical datasets for disease detection and prediction tasks, encompassing both visual and tabular modalities. Experimental results demonstrate that our method achieves state-of-the-art performance, particularly in challenging and practical scenarios where only a single modality is available. Furthermore, we show its adaptability through successful integration with a recent CT foundation model. Our findings highlight the effectiveness, efficiency, and generalizability of our approach for multimodal learning, offering a scalable, low-cost solution with significant potential for real-world clinical applications. The code is available atthis https URL.

View on arXiv

Comments on this paper