CardiacCLIP: Video-based CLIP Adaptation for LVEF Prediction in a Few-shot Manner

International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2025

21 September 2025

Yao Du

Jiarong Guo

Xiaomeng Li

ArXiv (abs)PDF HTML Github (2★)

Main:8 Pages

1 Figures

Bibliography:3 Pages

8 Tables

Abstract

Echocardiography is a vital non-invasive modality for cardiac assessment, with left ventricular ejection fraction (LVEF) serving as a key indicator of heart function. Existing LVEF estimation methods depend on large-scale annotated video datasets, which are costly and limit adaptability across various clinical settings. Recent vision-language models for echocardiography, such as EchoCLIP, apply image-to-text pretraining but fail to capture crucial temporal dynamics and localized cardiac structures essential for accurate diagnosis. To address these challenges, we propose CardiacCLIP, a video-based framework that enhances LVEF prediction through attention-based frame aggregation and multi-resolution input scaling. Specifically, we introduce MFL (Multi Frame Learning), a novel attention-based mechanism for selectively fusing informative frames, and EchoZoom, a multi-scale feature extraction strategy that refines spatial representations of cardiac structures. As a novel adaptation of CLIP models for few-shot echocardiogram video analysis, our approach significantly improves diagnostic accuracy, reducing MAE by 2.07 on the EchoNet-Dynamic dataset under 1-shot setting. The code is available atthis https URL.

View on arXiv

Comments on this paper