Few-Shot Referring Video Single- and Multi-Object Segmentation via Cross-Modal Affinity with Instance Sequence Matching

18 April 2025

Abstract

Referring video object segmentation (RVOS) aims to segment objects in videos guided by natural language descriptions. We propose FS-RVOS, a Transformer-based model with two key components: a cross-modal affinity module and an instance sequence matching strategy, which extends FS-RVOS to multi-object segmentation (FS-RVMOS). Experiments show FS-RVOS and FS-RVMOS outperform state-of-the-art methods across diverse benchmarks, demonstrating superior robustness and accuracy.

View on arXiv

@article{liu2025_2504.13710,
  title={ Few-Shot Referring Video Single- and Multi-Object Segmentation via Cross-Modal Affinity with Instance Sequence Matching },
  author={ Heng Liu and Guanghui Li and Mingqi Gao and Xiantong Zhen and Feng Zheng and Yang Wang },
  journal={arXiv preprint arXiv:2504.13710},
  year={ 2025 }
}

Comments on this paper