ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2504.13915
25
0

Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding

10 April 2025
Dibyadip Chatterjee
Edoardo Remelli
Yale Song
Bugra Tekin
Abhay Mittal
Bharat Bhatnagar
Necati Cihan Camgöz
Shreyas Hampali
Eric Sauser
Shugao Ma
Angela Yao
Fadime Sener
    VLM
ArXivPDFHTML
Abstract

We introduce ProVideLLM, an end-to-end framework for real-time procedural video understanding. ProVideLLM integrates a multimodal cache configured to store two types of tokens - verbalized text tokens, which provide compressed textual summaries of long-term observations, and visual tokens, encoded with DETR-QFormer to capture fine-grained details from short-term observations. This design reduces token count by 22x over existing methods in representing one hour of long-term observations while effectively encoding fine-granularity of the present. By interleaving these tokens in our multimodal cache, ProVideLLM ensures sub-linear scaling of memory and compute with video length, enabling per-frame streaming inference at 10 FPS and streaming dialogue at 25 FPS, with a minimal 2GB GPU memory footprint. ProVideLLM also sets new state-of-the-art results on six procedural tasks across four datasets.

View on arXiv
@article{chatterjee2025_2504.13915,
  title={ Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding },
  author={ Dibyadip Chatterjee and Edoardo Remelli and Yale Song and Bugra Tekin and Abhay Mittal and Bharat Bhatnagar and Necati Cihan Camgöz and Shreyas Hampali and Eric Sauser and Shugao Ma and Angela Yao and Fadime Sener },
  journal={arXiv preprint arXiv:2504.13915},
  year={ 2025 }
}
Comments on this paper