Segment Any Motion in Videos

28 March 2025

Abstract

Moving object segmentation is a crucial task for achieving a high-level understanding of visual scenes and has numerous downstream applications. Humans can effortlessly segment moving objects in videos. Previous work has largely relied on optical flow to provide motion cues; however, this approach often results in imperfect predictions due to challenges such as partial motion, complex deformations, motion blur and background distractions. We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features and leverages SAM2 for pixel-level mask densification through an iterative prompting strategy. Our model employs Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support. Extensive testing on diverse datasets demonstrates state-of-the-art performance, excelling in challenging scenarios and fine-grained segmentation of multiple objects. Our code is available atthis https URL.

View on arXiv

@article{huang2025_2503.22268,
  title={ Segment Any Motion in Videos },
  author={ Nan Huang and Wenzhao Zheng and Chenfeng Xu and Kurt Keutzer and Shanghang Zhang and Angjoo Kanazawa and Qianqian Wang },
  journal={arXiv preprint arXiv:2503.22268},
  year={ 2025 }
}

Comments on this paper