ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2408.10575
90
4

MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval

24 February 2025
Haoran Tang
Meng Cao
Jinfa Huang
Ruyang Liu
Peng Jin
Ge Li
Xiaodan Liang
    Mamba
ArXivPDFHTML
Abstract

Text-Video Retrieval (TVR) aims to align and associate relevant video content with corresponding natural language queries. Most existing TVR methods are based on large-scale pre-trained vision-language models (e.g., CLIP). However, due to the inherent plain structure of CLIP, few TVR methods explore the multi-scale representations which offer richer contextual information for a more thorough understanding. To this end, we propose MUSE, a multi-scale mamba with linear computational complexity for efficient cross-resolution modeling. Specifically, the multi-scale representations are generated by applying a feature pyramid on the last single-scale feature map. Then, we employ the Mamba structure as an efficient multi-scale learner to jointly learn scale-wise representations. Furthermore, we conduct comprehensive studies to investigate different model structures and designs. Extensive results on three popular benchmarks have validated the superiority of MUSE.

View on arXiv
@article{tang2025_2408.10575,
  title={ MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval },
  author={ Haoran Tang and Meng Cao and Jinfa Huang and Ruyang Liu and Peng Jin and Ge Li and Xiaodan Liang },
  journal={arXiv preprint arXiv:2408.10575},
  year={ 2025 }
}
Comments on this paper