Evaluating Vision Language Models (VLMs) for Radiology: A Comprehensive Analysis

Foundation models, trained on vast amounts of data using self-supervised techniques, have emerged as a promising frontier for advancing artificial intelligence (AI) applications in medicine. This study evaluates three different vision-language foundation models (RAD-DINO, CheXagent, and BiomedCLIP) on their ability to capture fine-grained imaging features for radiology tasks. The models were assessed across classification, segmentation, and regression tasks for pneumothorax and cardiomegaly on chest radiographs. Self-supervised RAD-DINO consistently excelled in segmentation tasks, while text-supervised CheXagent demonstrated superior classification performance. BiomedCLIP showed inconsistent performance across tasks. A custom segmentation model that integrates global and local features substantially improved performance for all foundation models, particularly for challenging pneumothorax segmentation. The findings highlight that pre-training methodology significantly influences model performance on specific downstream tasks. For fine-grained segmentation tasks, models trained without text supervision performed better, while text-supervised models offered advantages in classification and interpretability. These insights provide guidance for selecting foundation models based on specific clinical applications in radiology.
View on arXiv@article{li2025_2504.16047, title={ Evaluating Vision Language Models (VLMs) for Radiology: A Comprehensive Analysis }, author={ Frank Li and Hari Trivedi and Bardia Khosravi and Theo Dapamede and Mohammadreza Chavoshi and Abdulhameed Dere and Rohan Satya Isaac and Aawez Mansuri and Janice Newsome and Saptarshi Purkayastha and Judy Gichoya }, journal={arXiv preprint arXiv:2504.16047}, year={ 2025 } }