UniCrossAdapter: Multimodal Adaptation of CLIP for Radiology Report Generation

20 March 2025

Abstract

Automated radiology report generation aims to expedite the tedious and error-prone reporting process for radiologists. While recent works have made progress, learning to align medical images and textual findings remains challenging due to the relative scarcity of labeled medical data. For example, datasets for this task are much smaller than those used for image captioning in computer vision. In this work, we propose to transfer representations from CLIP, a large-scale pre-trained vision-language model, to better capture cross-modal semantics between images and texts. However, directly applying CLIP is suboptimal due to the domain gap between natural images and radiology. To enable efficient adaptation, we introduce UniCrossAdapter, lightweight adapter modules that are incorporated into CLIP and fine-tuned on the target task while keeping base parameters fixed. The adapters are distributed across modalities and their interaction to enhance vision-language alignment. Experiments on two public datasets demonstrate the effectiveness of our approach, advancing state-of-the-art in radiology report generation. The proposed transfer learning framework provides a means of harnessing semantic knowledge from large-scale pre-trained models to tackle data-scarce medical vision-language tasks. Code is available atthis https URL.

View on arXiv

@article{chen2025_2503.15940,
  title={ UniCrossAdapter: Multimodal Adaptation of CLIP for Radiology Report Generation },
  author={ Yaxiong Chen and Chuang Du and Chunlei Li and Jingliang Hu and Yilei Shi and Shengwu Xiong and Xiao Xiang Zhu and Lichao Mou },
  journal={arXiv preprint arXiv:2503.15940},
  year={ 2025 }
}

Comments on this paper