Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2405.21075
Cited By
v1
v2
v3 (latest)
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
31 May 2024
Chaoyou Fu
Yuhan Dai
Yondong Luo
Lei Li
Shuhuai Ren
Renrui Zhang
Zihan Wang
Chenyu Zhou
Chunjiang Ge
Mengdan Zhang
Peixian Chen
Yanwei Li
Shaohui Lin
Zhengye Zhang
Ke Li
Tong Xu
Xiawu Zheng
Enhong Chen
Caifeng Shan
Xing Sun
Xing Sun
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (25 upvotes)
Papers citing
"Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis"
50 / 550 papers shown
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
Wenshu Fan
Kevin Qinghong Lin
C. Chen
Mike Zheng Shou
LM&Ro
LRM
942
37
0
17 Mar 2025
Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos
Computer Vision and Pattern Recognition (CVPR), 2025
Chiara Plizzari
A. Tonioni
Yongqin Xian
Achin Kulshrestha
F. Tombari
EgoV
329
14
0
17 Mar 2025
ViSpeak: Visual Instruction Feedback in Streaming Videos
Shenghao Fu
Q. Yang
Yuan-Ming Li
Yi-Xing Peng
Kun-Yu Lin
Xihan Wei
Jian-Fang Hu
Xiaohua Xie
Wei-Shi Zheng
VLM
302
11
0
17 Mar 2025
Efficient Motion-Aware Video MLLM
Computer Vision and Pattern Recognition (CVPR), 2025
Zijia Zhao
Yuqi Huo
Tongtian Yue
Longteng Guo
Haoyu Lu
Binghai Wang
Xin Wu
Qingbin Liu
265
4
0
17 Mar 2025
Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding
Weiyu Guo
Ziyang Chen
Shaoguang Wang
Jianxiang He
Yijie Xu
Jinhui Ye
Ying Sun
Hui Xiong
359
18
0
17 Mar 2025
NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models
Sung-Yeon Park
Can Cui
Yunsheng Ma
Ahmadreza Moradipari
Rohit Gupta
Kyungtae Han
Ziran Wang
257
12
0
17 Mar 2025
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?
Tianyuan Qu
Longxiang Tang
Bohao Peng
Senqiao Yang
Bei Yu
Jiaya Jia
VLM
981
11
0
16 Mar 2025
AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Xiao Wang
Qingyi Si
Yue Yu
Shiyu Zhu
Zheng Lin
Liqiang Nie
VLM
421
31
0
16 Mar 2025
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
Weiming Ren
Wentao Ma
Huan Yang
Cong Wei
Ge Zhang
Lei Ma
Mamba
318
20
0
14 Mar 2025
V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning
Zixu Cheng
Jian Hu
Ziquan Liu
Chenyang Si
Wei Li
Shaogang Gong
LRM
337
26
0
14 Mar 2025
FastVID: Dynamic Density Pruning for Fast Video Large Language Models
Leqi Shen
Guoqiang Gong
Tao He
Yifeng Zhang
Pengzhang Liu
Sicheng Zhao
Guiguang Ding
VLM
410
16
0
14 Mar 2025
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
Boyu Chen
Zhengrong Yue
Siran Chen
Xiping Hu
Yang Liu
Ziwei Sun
Longji Xu
VLM
1.3K
21
0
13 Mar 2025
TIME: Temporal-Sensitive Multi-Dimensional Instruction Tuning and Robust Benchmarking for Video-LLMs
Yunxiao Wang
Meng Liu
Rui Shao
Haoyu Zhang
Bin Wen
Fan Yang
Yan Li
Di Zhang
Liqiang Nie
Liqiang Nie
261
5
0
13 Mar 2025
TruthPrInt: Mitigating LVLM Object Hallucination Via Latent Truthful-Guided Pre-Intervention
Jinhao Duan
Fei Kong
Hao-Ran Cheng
James Diffenderfer
B. Kailkhura
Lichao Sun
Xiaofeng Zhu
Xiaoshuang Shi
Kaidi Xu
999
7
0
13 Mar 2025
Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing
Yudong Liu
Jingwei Sun
Yueqian Lin
Jingyang Zhang
Ming Yin
Qinsi Wang
Jing Zhang
Haoyang Li
Yiran Chen
VLM
516
6
0
13 Mar 2025
Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Qiji Zhou
Yifan Gong
Guangsheng Bao
Hongjie Qiu
Jinqiang Li
Xiangrong Zhu
Huajian Zhang
Yue Zhang
LRM
268
3
0
12 Mar 2025
CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games
Peng Chen
Pi Bu
Yuhang Han
Xinyi Wang
Xiangqi Jin
...
Qi Zhu
Jun Song
Siran Yang
Jiamang Wang
Bo Zheng
344
8
0
12 Mar 2025
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
Computer Vision and Pattern Recognition (CVPR), 2025
Md. Mohaiminul Islam
Tushar Nagarajan
Huiyu Wang
Gedas Bertasius
Lorenzo Torresani
1.0K
11
0
12 Mar 2025
Everything Can Be Described in Words: A Simple Unified Multi-Modal Framework with Semantic and Temporal Alignment
Xiaowei Bi
Zheyuan Xu
359
3
0
12 Mar 2025
VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers
Ruanjun Li
Yuedong Tan
Yuanming Shi
Jiawei Shao
VLM
730
4
0
12 Mar 2025
Generative Frame Sampler for Long Video Understanding
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Linli Yao
Haoning Wu
Kun Ouyang
Yujiao Shi
Caiming Xiong
Bei Chen
Xu Sun
Junnan Li
VLM
VGen
290
16
0
12 Mar 2025
Memory-enhanced Retrieval Augmentation for Long Video Understanding
Huaying Yuan
Zhengyang Liang
Minhao Qin
Hongjin Qian
Yan Shu
Zhicheng Dou
Ji-Rong Wen
Andrii Zadaianchuk
VOS
RALM
VLM
365
9
0
12 Mar 2025
EgoBlind: Towards Egocentric Visual Assistance for the Blind
Junbin Xiao
Nanxin Huang
Hao Qiu
Zhulin Tao
Xun Yang
Richang Hong
Ming Wang
Angela Yao
EgoV
VLM
503
8
0
11 Mar 2025
RAG-Adapter: A Plug-and-Play RAG-enhanced Framework for Long Video Understanding
Xichen Tan
Yunfan Ye
Yuanjing Luo
Qian Wan
Fang Liu
Zhiping Cai
VLM
248
3
0
11 Mar 2025
ALLVB: All-in-One Long Video Understanding Benchmark
AAAI Conference on Artificial Intelligence (AAAI), 2025
Xichen Tan
Yuanjing Luo
Yunfan Ye
Fang Liu
Zhiping Cai
MLLM
VLM
391
4
0
10 Mar 2025
Video Action Differencing
International Conference on Learning Representations (ICLR), 2025
James Burgess
Xiaohan Wang
Yuhui Zhang
Anita Rau
Alejandro Lozano
Lisa Dunlap
Trevor Darrell
Serena Yeung-Levy
VGen
317
8
0
10 Mar 2025
StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition
Xin Ding
Hao Wu
Yue Yang
Shiqi Jiang
Donglin Bai
Zhibo Chen
Ting Cao
938
9
0
08 Mar 2025
UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Baining Zhao
Jianjie Fang
Zichao Dai
Liang Luo
Jirong Zha
...
Chen Gao
Yijiao Wang
Jinqiang Cui
Xinlei Chen
Yongqian Li
352
21
0
08 Mar 2025
CASP: Compression of Large Multimodal Models Based on Attention Sparsity
Computer Vision and Pattern Recognition (CVPR), 2025
Mohsen Gholami
Mohammad Akbari
Kevin Cannons
Yong Zhang
263
2
0
07 Mar 2025
Unified Reward Model for Multimodal Understanding and Generation
Yibin Wang
Yuhang Zang
Hao Li
Cheng Jin
Jiadong Wang
EGVM
397
81
0
07 Mar 2025
E
2
^2
2
AT: Multimodal Jailbreak Defense via Dynamic Joint Optimization for Multimodal Large Language Models
Liming Lu
Shuchao Pang
Yaning Tan
Haotian Zhu
Xiyu Zeng
Aishan Liu
Yunhuai Liu
Yongbin Zhou
AAML
447
17
0
05 Mar 2025
EgoLife: Towards Egocentric Life Assistant
Computer Vision and Pattern Recognition (CVPR), 2025
Jingkang Yang
Shuai Liu
Hongming Guo
Yuhao Dong
Xinyu Zhang
...
Joerg Widmer
Francesco Gringoli
Lei Yang
Bo Li
Ziwei Liu
EgoV
278
12
0
05 Mar 2025
HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization
Computer Vision and Pattern Recognition (CVPR), 2025
Zitang Zhou
Ke Mei
Yu Lu
Tianyi Wang
Fengyun Rao
430
7
0
03 Mar 2025
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Abdelrahman Abouelenin
Atabak Ashfaq
Adam Atkinson
Hany Awadalla
Nguyen Bach
...
Ishmam Zabir
Yunan Zhang
Li Zhang
Yanzhe Zhang
Xiren Zhou
MoE
SyDa
302
294
0
03 Mar 2025
Adaptive Keyframe Sampling for Long Video Understanding
Computer Vision and Pattern Recognition (CVPR), 2025
Xi Tang
Jihao Qiu
Lingxi Xie
Yunjie Tian
Jianbin Jiao
Qixiang Ye
268
68
0
28 Feb 2025
Nexus: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision
Che Liu
Yingji Zhang
D. Zhang
Weijie Zhang
Chenggong Gong
...
Junwei Liao
Haipang Wu
Ji Liu
André Freitas
Qifan Wang
AuLLM
600
8
0
26 Feb 2025
M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance
Qingpei Guo
Kaiyou Song
Zipeng Feng
Ziping Ma
Qinglong Zhang
...
Yunxiao Sun
Tai-WeiChang
Jingdong Chen
Ming Yang
Jun Zhou
MLLM
VLM
632
12
0
26 Feb 2025
MMAD: A Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection
International Conference on Learning Representations (ICLR), 2024
Xi Jiang
Jian Li
Hanqiu Deng
Wenshu Fan
Bin-Bin Gao
Yifeng Zhou
Jialin Li
Chengjie Wang
Feng Zheng
422
0
0
24 Feb 2025
MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models
Hengzhi Li
Megan Tjandrasuwita
Yi R. Fung
Armando Solar-Lezama
Paul Pu Liang
487
7
0
23 Feb 2025
Magma: A Foundation Model for Multimodal AI Agents
Computer Vision and Pattern Recognition (CVPR), 2025
Jianwei Yang
Reuben Tan
Qianhui Wu
Ruijie Zheng
Baolin Peng
...
Seonghyeon Ye
Joel Jang
Yuquan Deng
Lars Liden
Jianfeng Gao
VLM
AI4TS
371
95
0
18 Feb 2025
SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic Embeddings
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Weikai Lu
Hao Peng
Huiping Zhuang
Cen Chen
Huiping Zhuang
285
5
0
18 Feb 2025
VRoPE: Rotary Position Embedding for Video Large Language Models
Zikang Liu
Longteng Guo
Yepeng Tang
Tongtian Yue
Junxian Cai
Kai Ma
Qingbin Liu
Xi Chen
Jing Liu
386
7
0
17 Feb 2025
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Guangzhi Sun
Yudong Yang
Jimin Zhuang
Changli Tang
Yongqian Li
W. Li
Tianhao Shen
Chao Zhang
LRM
MLLM
VLM
323
14
0
17 Feb 2025
Unhackable Temporal Rewarding for Scalable Video MLLMs
En Yu
Kangheng Lin
Liang Zhao
Yana Wei
Zining Zhu
...
Jianjian Sun
Zheng Ge
Xinsong Zhang
Jingyu Wang
Wenbing Tao
286
22
0
17 Feb 2025
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding
International Conference on Learning Representations (ICLR), 2025
Zhenyu Yang
Yihan Hu
Zemin Du
Dizhan Xue
Chuanrui Hu
Jiahong Wu
Fan Yang
Weiming Dong
Changsheng Xu
334
27
0
15 Feb 2025
CoS: Chain-of-Shot Prompting for Long Video Understanding
Jian Hu
Zixu Cheng
Chenyang Si
Wei Li
Shaogang Gong
303
18
0
10 Feb 2025
LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models
Tzu-Tao Chang
Shivaram Venkataraman
VLM
1.3K
1
0
04 Feb 2025
VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos
Xubin Ren
Lingrui Xu
Long Xia
Shuaiqiang Wang
D. Yin
Chao Huang
VGen
VLM
355
30
0
03 Feb 2025
Towards Robust Multimodal Large Language Models Against Jailbreak Attacks
Ziyi Yin
Yuanpu Cao
Han Liu
Ting Wang
Jinghui Chen
Fenhlong Ma
AAML
341
2
0
02 Feb 2025
Baichuan-Omni-1.5 Technical Report
Yadong Li
Qingbin Liu
Tao Zhang
Tao Zhang
Tian Jin
...
Jianhua Xu
Haoze Sun
Mingan Lin
Guosheng Dong
Xin Wu
AuLLM
330
65
0
28 Jan 2025
Previous
1
2
3
...
10
11
8
9
Next
Page 9 of 11
Page
of 11
Go