ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2507.02591
5
0

AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding

3 July 2025
Weili Xu
Enxin Song
Wenhao Chai
Xuexiang Wen
Tian Ye
Gaoang Wang
ArXiv (abs)PDFHTML
Main:8 Pages
15 Figures
Bibliography:7 Pages
12 Tables
Appendix:8 Pages
Abstract

The challenge of long video understanding lies in its high computational complexity and prohibitive memory cost, since the memory and computation required by transformer-based LLMs scale quadratically with input sequence length. We propose AuroraLong to address this challenge by replacing the LLM component in MLLMs with a linear RNN language model that handles input sequence of arbitrary length with constant-size hidden states. To further increase throughput and efficiency, we combine visual token merge with linear RNN models by reordering the visual tokens by their sizes in ascending order. Despite having only 2B parameters and being trained exclusively on public data, AuroraLong achieves performance comparable to Transformer-based models of similar size trained on private datasets across multiple video benchmarks. This demonstrates the potential of efficient, linear RNNs to democratize long video understanding by lowering its computational entry barrier. To our best knowledge, we are the first to use a linear RNN based LLM backbone in a LLaVA-like model for open-ended video understanding.

View on arXiv
@article{xu2025_2507.02591,
  title={ AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding },
  author={ Weili Xu and Enxin Song and Wenhao Chai and Xuexiang Wen and Tian Ye and Gaoang Wang },
  journal={arXiv preprint arXiv:2507.02591},
  year={ 2025 }
}
Comments on this paper