v1v2 (latest)

Can language-guided unsupervised adaptation improve medical image classification using unpaired images and texts?

IEEE International Symposium on Biomedical Imaging (ISBI), 2024

3 September 2024

Umaima Rahman

Raza Imam

Dwarikanath Mahapatra

Boulbaba Ben Amor

Dwarikanath Mahapatra

VLM

ArXiv (abs)PDF HTML Github (1★)

Main:4 Pages

5 Figures

Bibliography:1 Pages

3 Tables

Abstract

In medical image classification, supervised learning is challenging due to the scarcity of labeled medical images. To address this, we leverage the visual-textual alignment within Vision-Language Models (VLMs) to enable unsupervised learning of a medical image classifier. In this work, we propose \underline{Med}ical \underline{Un}supervised \underline{A}daptation (\texttt{MedUnA}) of VLMs, where the LLM-generated descriptions for each class are encoded into text embeddings and matched with class labels via a cross-modal adapter. This adapter attaches to a visual encoder of \texttt{MedCLIP} and aligns the visual embeddings through unsupervised learning, driven by a contrastive entropy-based loss and prompt tuning. Thereby, improving performance in scenarios where textual information is more abundant than labeled images, particularly in the healthcare domain. Unlike traditional VLMs, \texttt{MedUnA} uses \textbf{unpaired images and text} for learning representations and enhances the potential of VLMs beyond traditional constraints. We evaluate the performance on three chest X-ray datasets and two multi-class datasets (diabetic retinopathy and skin lesions), showing significant accuracy gains over the zero-shot baseline. Our code is available atthis https URL.

View on arXiv

Comments on this paper