Video object segmentation is crucial for the efficient analysis of complex medical video data, yet it faces significant challenges in data availability and annotation. We introduce the task of one-shot medical video object segmentation, which requires separating foreground and background pixels throughout a video given only the mask annotation of the first frame. To address this problem, we propose a temporal contrastive memory network comprising image and mask encoders to learn feature representations, a temporal contrastive memory bank that aligns embeddings from adjacent frames while pushing apart distant ones to explicitly model inter-frame relationships and stores these features, and a decoder that fuses encoded image features and memory readouts for segmentation. We also collect a diverse, multi-source medical video dataset spanning various modalities and anatomies to benchmark this task. Extensive experiments demonstrate state-of-the-art performance in segmenting both seen and unseen structures from a single exemplar, showing ability to generalize from scarce labels. This highlights the potential to alleviate annotation burdens for medical video analysis. Code is available atthis https URL.
View on arXiv@article{chen2025_2503.14979, title={ One-Shot Medical Video Object Segmentation via Temporal Contrastive Memory Networks }, author={ Yaxiong Chen and Junjian Hu and Chunlei Li and Zixuan Zheng and Jingliang Hu and Yilei Shi and Shengwu Xiong and Xiao Xiang Zhu and Lichao Mou }, journal={arXiv preprint arXiv:2503.14979}, year={ 2025 } }