Referring video object segmentation (RVOS) is a challenging task that requires the model to segment the object in a video given the language description. MeViS is a recently proposed dataset that contains motion expressions of the target objects, leading to a challenging benchmark, compared with existing RVOS benchmarks. On the other hand, for referring expression tasks, a new trend is to adopt multi-modal large language model (MLLM) to achieve better image and text alignment. In this report, we show that with a simple modification to the test time inference method on stronger MLLMs, we can lead to stronger results on MeVIS. In particular, we adopt the recent method Sa2VA, a unified model for dense grounded understanding of both images and videos. By enlarging the scope of key frames, without any further training, we can achieve the 3rd place in the 4th PVUW workshop.
View on arXiv@article{yuan2025_2504.00476, title={ 4th PVUW MeViS 3rd Place Report: Sa2VA }, author={ Haobo Yuan and Tao Zhang and Xiangtai Li and Lu Qi and Zilong Huang and Shilin Xu and Jiashi Feng and Ming-Hsuan Yang }, journal={arXiv preprint arXiv:2504.00476}, year={ 2025 } }