ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2412.01136
80
0

Referring Video Object Segmentation via Language-aligned Track Selection

2 December 2024
Seongchan Kim
Woojeong Jin
Sangbeom Lim
Heeji Yoon
Hyunwook Choi
Seungryong Kim
    VOS
ArXivPDFHTML
Abstract

Referring video object segmentation (RVOS) requires tracking and segmenting an object throughout a video according to a given natural language expression, demanding both complex motion understanding and the alignment of visual representations with language descriptions. Given these challenges, the recently proposed Segment Anything Model 2 (SAM2) emerges as a potential candidate due to its ability to generate coherent segmentation mask tracks across video frames, and provide an inherent spatio-temporal objectness in its object token representations. In this paper, we introduce SOLA (Selection by Object Language Alignment), a novel framework that leverages SAM2 object tokens as compact video-level object representations, which are aligned with language features through a lightweight track selection module. To effectively facilitate this alignment, we propose an IoU-based pseudo-labeling strategy, which bridges the modality gap between SAM2 representations with language features. Extensive experiments show that SOLA achieves state-of-the-art performance on the MeViS dataset and demonstrate that SOLA offers an effective solution for RVOS. Our project page is available at:this https URL.

View on arXiv
@article{kim2025_2412.01136,
  title={ Referring Video Object Segmentation via Language-aligned Track Selection },
  author={ Seongchan Kim and Woojeong Jin and Sangbeom Lim and Heeji Yoon and Hyunwook Choi and Seungryong Kim },
  journal={arXiv preprint arXiv:2412.01136},
  year={ 2025 }
}
Comments on this paper