ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2501.06138
492
5
v1v2v3 (latest)

MS-Temba : Multi-Scale Temporal Mamba for Efficient Temporal Action Detection

10 January 2025
Arkaprava Sinha
Monish Soundar Raj
Pu Wang
Ahmed Helmy
Srijan Das
    Mamba
ArXiv (abs)PDFHTMLHuggingFace (1 upvotes)
Main:8 Pages
11 Figures
Bibliography:4 Pages
10 Tables
Appendix:3 Pages
Abstract

Temporal Action Detection (TAD) in untrimmed videos requires models that can efficiently (1) process long-duration videos, (2) capture temporal variations within action classes, and (3) handle dense, overlapping actions, all while remaining suitable for resource-constrained edge deployment. While Transformer-based methods achieve high accuracy, their quadratic complexity hinders deployment in such scenarios. Given the recent popularity of linear complexity Mamba-based models, leveraging them for TAD is a natural choice. However, naively adapting Mamba from language or vision tasks fails to provide an optimal solution and does not address the challenges of long, untrimmed videos. Therefore, we propose Multi-Scale Temporal Mamba (MS-Temba), the first Mamba-based architecture specifically designed for densely labeled TAD tasks. MS-Temba features Temporal Mamba Blocks (Temba Blocks), consisting of Temporal Convolutional Module (TCM) and Dilated SSM (D-SSM). TCM captures short-term dependencies using dilated convolutions, while D-SSM introduces a novel dilated state-space mechanism to model long-range temporal relationships effectively at each temporal scale. These multi-scale representations are aggregated by Scale-Aware State Fuser, which learns a unified representation for detecting densely overlapping actions. Experiments show that MS-Temba achieves state-of-the-art performance on long-duration videos, remains competitive on shorter segments, and reduces model complexity by 88%. Its efficiency and effectiveness make MS-Temba well-suited for real-world edge deployment.

View on arXiv
Comments on this paper