Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2403.11481
Cited By
v1
v2 (latest)
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
18 March 2024
Yue Fan
Xiaojian Ma
Rujie Wu
Yuntao Du
Jiaqi Li
Zhi Gao
Qing Li
VLM
LLMAG
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (13 upvotes)
Papers citing
"VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding"
48 / 98 papers shown
Title
A Grounded Memory System For Smart Personal Assistants
Felix Ocker
J. Deigmöller
Pavel Smirnov
Julian Eggert
212
2
0
09 May 2025
FSBench: A Figure Skating Benchmark for Advancing Artistic Sports Understanding
Computer Vision and Pattern Recognition (CVPR), 2025
Rong Gao
Xin Liu
Zhuozhao Hu
Bohao Xing
Baiqiang Xia
Zitong Yu
Heikki Kälviäinen
241
2
0
28 Apr 2025
VideoMultiAgents: A Multi-Agent Framework for Video Question Answering
Noriyuki Kugo
Xiang Li
Zhiyu Li
Ashish Gupta
Arpandeep Khatua
...
Yuta Kyuragi
Yasunori Ishii
Masamoto Tanabiki
Kazuki Kozuka
Ehsan Adeli
360
9
0
25 Apr 2025
TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation
Ling You
Hao Wu
Xinni Xie
Xiangyi Wei
Bangyan Li
Shaohui Lin
Yang Li
Changbo Wang
VGen
947
3
0
24 Apr 2025
MR. Video: "MapReduce" is the Principle for Long Video Understanding
Ziqi Pang
Yu-Xiong Wang
VLM
193
5
0
22 Apr 2025
REEF: Relevance-Aware and Efficient LLM Adapter for Video Understanding
Sakib Reza
Xiyun Song
Heather Yu
Zongfang Lin
Mohsen Moghaddam
Mario Sznaier
178
0
0
07 Apr 2025
Building LLM Agents by Incorporating Insights from Computer Systems
Yapeng Mi
Zhi Gao
Xiaojian Ma
Qing Li
LLMAG
284
1
0
06 Apr 2025
VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT
Zhuo Zhi
Qiangqiang Wu
Minghe shen
Wenbo Li
Yinchuan Li
Youssef Attia El Hili
Kaiwen Zhou
LLMAG
395
13
0
06 Apr 2025
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
Computer Vision and Pattern Recognition (CVPR), 2025
Lucas Ventura
Antoine Yang
Cordelia Schmid
Gül Varol
210
1
0
31 Mar 2025
FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs
Carlos Plou
Cesar Borja
Ruben Martinez-Cantin
Ana C. Murillo
206
0
0
25 Mar 2025
Agentic Keyframe Search for Video Question Answering
Sunqi Fan
Meng-Hao Guo
Shuojin Yang
169
3
0
20 Mar 2025
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
Wenshu Fan
Kevin Qinghong Lin
C. Chen
Mike Zheng Shou
LM&Ro
LRM
821
31
0
17 Mar 2025
VITED: Video Temporal Evidence Distillation
Computer Vision and Pattern Recognition (CVPR), 2025
Yujie Lu
Yale Song
William Yang Wang
Lorenzo Torresani
Tushar Nagarajan
926
2
0
17 Mar 2025
Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding
Weiyu Guo
Ziyang Chen
Shaoguang Wang
Jianxiang He
Yijie Xu
Jinhui Ye
Ying Sun
Hui Xiong
279
15
0
17 Mar 2025
A Survey on the Optimization of Large Language Model-based Agents
Shangheng Du
Jiabao Zhao
Jinxin Shi
Zhentao Xie
Xin Jiang
Yanhong Bai
Xiaoling Wang
LLMAG
LM&Ro
LM&MA
973
14
0
16 Mar 2025
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
Boyu Chen
Zhengrong Yue
Siran Chen
Xiping Hu
Yang Liu
Ziwei Sun
Longji Xu
VLM
1.1K
14
0
13 Mar 2025
Long-Video Audio Synthesis with Multi-Agent Collaboration
Yehang Zhang
Xinli Xu
Xiaojie Xu
L. Liu
Yuxiao Chen
DiffM
VGen
228
2
0
13 Mar 2025
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
Computer Vision and Pattern Recognition (CVPR), 2025
Kevin Qinghong Lin
Mike Zheng Shou
VGen
918
3
0
12 Mar 2025
StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition
Xin Ding
Hao Wu
Yue Yang
Shiqi Jiang
Donglin Bai
Zhibo Chen
Ting Cao
849
9
0
08 Mar 2025
Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs
Gengyuan Zhang
Mingcong Ding
Tong Liu
Yao Zhang
Volker Tresp
369
2
0
24 Feb 2025
OmniQuery: Contextually Augmenting Captured Multimodal Memory to Enable Personal Question Answering
International Conference on Human Factors in Computing Systems (CHI), 2024
Jiahao Nick Li
Zhuohao Jerry Zhang
Zhang
375
5
0
24 Feb 2025
VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos
Xubin Ren
Lingrui Xu
Long Xia
Shuaiqiang Wang
D. Yin
Chao Huang
VGen
VLM
287
24
0
03 Feb 2025
ENTER: Event Based Interpretable Reasoning for VideoQA
Hammad A. Ayyubi
Junzhang Liu
Ali Asgarov
Zaber Ibn Abdul Hakim
Najibul Haque Sarker
...
Md. Atabuzzaman
Xudong Lin
Naveen Reddy Dyava
Shih-Fu Chang
Chris Thomas
NAI
470
3
0
24 Jan 2025
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Yi Wang
Xinhao Li
Ziang Yan
Yinan He
Jiashuo Yu
...
Kai Chen
Wenhai Wang
Yu Qiao
Yali Wang
Limin Wang
432
101
0
21 Jan 2025
Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs
Computer Vision and Pattern Recognition (CVPR), 2025
Zeyi Huang
Zeyi Huang
Xiaofang Wang
Nikhil Mehta
Tong Xiao
...
Bolin Lai
Licheng Yu
Ning Zhang
Yong Jae Lee
Miao Liu
118
6
0
08 Jan 2025
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
Computer Vision and Pattern Recognition (CVPR), 2025
Rui Qian
Shuangrui Ding
Xiaoyi Dong
Pan Zhang
Yuhang Zang
Yuhang Cao
Dahua Lin
Jiaqi Wang
196
29
0
06 Jan 2025
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Pan Zhang
Xiaoyi Dong
Yuhang Cao
Yuhang Zang
Rui Qian
...
Xinsong Zhang
Kai Chen
Yu Qiao
Dahua Lin
Jiaqi Wang
KELM
334
31
0
12 Dec 2024
Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey
Yunkai Dang
Kaichen Huang
Jiahao Huo
Yibo Yan
Shijie Huang
...
Kun Wang
Yong Liu
Jing Shao
Hui Xiong
Xuming Hu
LRM
357
46
0
03 Dec 2024
SEAL: Semantic Attention Learning for Long Video Representation
Computer Vision and Pattern Recognition (CVPR), 2024
Lan Wang
Yujia Chen
Wen-Sheng Chu
Vishnu Boddeti
Du Tran
VLM
426
7
0
02 Dec 2024
Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory
Zaira Manigrasso
Matteo Dunnhofer
Antonino Furnari
Moritz Nottebaum
Antonio Finocchiaro
Davide Marana
Rosario Forte
G. Farinella
C. Micheloni
295
3
0
25 Nov 2024
MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices
Mohammadali Shakerdargah
Shan Lu
Chao Gao
Di Niu
368
1
0
20 Nov 2024
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
Yongdong Luo
Xiawu Zheng
Guilin Li
Guilin Li
Haojia Lin
...
Jinfa Huang
Jiayi Ji
Jiebo Luo
Rongrong Ji
Rongrong Ji
VLM
538
67
0
20 Nov 2024
HourVideo: 1-Hour Video-Language Understanding
Neural Information Processing Systems (NeurIPS), 2024
Keshigeyan Chandrasegaran
Agrim Gupta
Lea M. Hadzic
Taran Kota
Jimming He
Cristobal Eyzaguirre
Zane Durante
Pengfei Yu
Jiajun Wu
L. Fei-Fei
VLM
213
82
0
07 Nov 2024
Aligning Audio-Visual Joint Representations with an Agentic Workflow
Neural Information Processing Systems (NeurIPS), 2024
Shentong Mo
Yibing Song
201
2
0
30 Oct 2024
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
International Conference on Learning Representations (ICLR), 2024
Xiangyu Zeng
Kunchang Li
Chenting Wang
Xinhao Li
Tianxiang Jiang
...
Zhengrong Yue
Yi Wang
Yali Wang
Yu Qiao
Limin Wang
MLLM
VLM
AI4TS
227
52
0
25 Oct 2024
VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding
Conference on Robot Learning (CoRL), 2024
Runsen Xu
Zhiwei Huang
Tai Wang
Yuxiao Chen
Jiangmiao Pang
Dahua Lin
VGen
198
34
0
17 Oct 2024
Episodic Memory Verbalization using Hierarchical Representations of Life-Long Robot Experience
Leonard Barmann
Chad DeChant
Joana Plewnia
Fabian Peller-Konrad
Daniel Bauer
Tamim Asfour
Alex Waibel
LM&Ro
341
4
0
26 Sep 2024
AMEGO: Active Memory from long EGOcentric videos
European Conference on Computer Vision (ECCV), 2024
Gabriele Goletto
Tushar Nagarajan
Giuseppe Averta
Dima Damen
EgoV
176
18
0
17 Sep 2024
Question-Answering Dense Video Events
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2024
Hangyu Qin
Junbin Xiao
Angela Yao
VLM
421
6
0
06 Sep 2024
VideoQA in the Era of LLMs: An Empirical Study
International Journal of Computer Vision (IJCV), 2024
Junbin Xiao
Nanxin Huang
Hangyu Qin
Dongyang Li
Yicong Li
...
Zhulin Tao
Jianxing Yu
Liang Lin
Tat-Seng Chua
Angela Yao
279
23
0
08 Aug 2024
FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models
Pengxiang Li
Zhi Gao
Bofei Zhang
Tao Yuan
Yuwei Wu
Mehrtash Harandi
Yunde Jia
Song-Chun Zhu
Qing Li
VLM
MLLM
242
9
0
16 Jul 2024
UltraEdit: Instruction-based Fine-Grained Image Editing at Scale
Haozhe Zhao
Xiaojian Ma
Liang Chen
Shuzheng Si
Rujie Wu
Kaikai An
Peiyu Yu
Minjia Zhang
Qing Li
Baobao Chang
284
147
0
07 Jul 2024
OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer
Lu Zhang
Tiancheng Zhao
Heting Ying
Yibo Ma
Kyusong Lee
LLMAG
227
24
0
24 Jun 2024
DrVideo: Document Retrieval Based Long Video Understanding
Computer Vision and Pattern Recognition (CVPR), 2024
Ziyu Ma
Chenhui Gou
Hengcan Shi
Bin Sun
Shutao Li
Hamid Rezatofighi
Jianfei Cai
VLM
182
35
0
18 Jun 2024
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
Jongwoo Park
Kanchana Ranasinghe
Kumara Kahatapitiya
Wonjeong Ryoo
Donghyun Kim
Michael S. Ryoo
282
53
0
13 Jun 2024
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Ziyang Wang
Shoubin Yu
Elias Stengel-Eskin
Jaehong Yoon
Feng Cheng
Gedas Bertasius
Mohit Bansal
384
137
0
29 May 2024
Video Understanding with Large Language Models: A Survey
Yunlong Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
...
Feng Zheng
Jianguo Zhang
Chenliang Xu
Jiebo Luo
Chenliang Xu
VLM
571
152
0
29 Dec 2023
CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update
Zhi Gao
Yuntao Du
Xintong Zhang
Xiaojian Ma
Wenjuan Han
Song-Chun Zhu
Qing Li
LLMAG
VLM
303
43
0
18 Dec 2023
Previous
1
2