Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2405.21075
Cited By
v1
v2
v3 (latest)
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
31 May 2024
Chaoyou Fu
Yuhan Dai
Yondong Luo
Lei Li
Shuhuai Ren
Renrui Zhang
Zihan Wang
Chenyu Zhou
Chunjiang Ge
Mengdan Zhang
Peixian Chen
Yanwei Li
Shaohui Lin
Zhengye Zhang
Ke Li
Tong Xu
Xiawu Zheng
Enhong Chen
Caifeng Shan
Xing Sun
Xing Sun
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (25 upvotes)
Papers citing
"Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis"
50 / 550 papers shown
VideoAds for Fast-Paced Video Understanding
Zheyuan Zhang
Monica Dou
Linkai Peng
Hongyi Pan
Ulas Bagci
Boqing Gong
VLM
289
1
0
12 Apr 2025
PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models
Computer Vision and Pattern Recognition (CVPR), 2025
M. Dhouib
Davide Buscaldi
Sonia Vanier
A. Shabou
VLM
318
15
0
11 Apr 2025
Kimi-VL Technical Report
Kimi Team
Angang Du
B. Yin
Bowei Xing
Bowen Qu
...
Longxiang Zhang
Zhe Chen
Zijia Zhao
Ziwei Chen
Zongyu Lin
MLLM
VLM
MoE
976
143
0
10 Apr 2025
VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning
Yukun Qi
Yiming Zhao
Y. Zeng
Xikun Bao
Wenjie Huang
Lin Yen-Chen
Zehui Chen
Jie Zhao
Zhongang Qi
Feng Zhao
LRM
316
18
0
10 Apr 2025
Capybara-OMNI: An Efficient Paradigm for Building Omni-Modal Language Models
Xingguang Ji
Jiakang Wang
Hongzhi Zhang
Jingyuan Zhang
Haonan Zhou
Chenxi Sun
Wenshu Fan
Qi Wang
Fuzheng Zhang
MLLM
VLM
302
1
0
10 Apr 2025
SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding
Computer Vision and Pattern Recognition (CVPR), 2025
Yangliu Hu
Zikai Song
Na Feng
Yawei Luo
Junqing Yu
Yi-Ping Phoebe Chen
Wei Yang
197
11
0
10 Apr 2025
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding
Ziyi Wang
Haoran Wu
Yiming Rong
Deyang Jiang
Yixin Zhang
Yue Zhao
Shuang Xu
Bo Xu
VLM
219
3
0
09 Apr 2025
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
Xinhao Li
Ziang Yan
Desen Meng
Yi Liu
Xiangyu Zeng
Yinan He
Yun Wang
Yu Qiao
Yi Wang
Limin Wang
VLM
AI4TS
LRM
808
120
0
09 Apr 2025
PaMi-VDPO: Mitigating Video Hallucinations by Prompt-Aware Multi-Instance Video Preference Learning
Xinpeng Ding
Jianchao Tan
Jinahua Han
Lanqing Hong
Hang Xu
Xuelong Li
MLLM
VLM
1.1K
6
0
08 Apr 2025
SmolVLM: Redefining small and efficient multimodal models
Andres Marafioti
Orr Zohar
Miquel Farré
Merve Noyan
Elie Bakouch
...
Hugo Larcher
Mathieu Morlon
Lewis Tunstall
Leandro von Werra
Thomas Wolf
VLM
463
117
0
07 Apr 2025
Advancing Egocentric Video Question Answering with Multimodal Large Language Models
Alkesh Patel
Vibhav Chitalia
Yinfei Yang
184
5
0
06 Apr 2025
MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models
Wulin Xie
Yujiao Shi
Chaoyou Fu
Yang Shi
Bingyan Nie
Hongkai Chen
Zheng Zhang
Liang Wang
Tieniu Tan
419
8
0
04 Apr 2025
SocialGesture: Delving into Multi-person Gesture Understanding
Computer Vision and Pattern Recognition (CVPR), 2025
Xu Cao
Pranav Virupaksha
Wenqi Jia
Bolin Lai
Fiona Ryan
Sangmin Lee
James M. Rehg
SLR
230
5
0
03 Apr 2025
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
Chuanqi Cheng
Jian Guan
Wei Wu
Rui Yan
VLM
677
17
0
03 Apr 2025
TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding
Junwen Pan
Rui Zhang
Xin Wan
Yuan Zhang
Ming Lu
Qi She
VLM
305
4
0
02 Apr 2025
Slow-Fast Architecture for Video Multi-Modal Large Language Models
Min Shi
Shihao Wang
Chieh-Yun Chen
Jitesh Jain
Kai Wang
Junjun Xiong
Guilin Liu
Zhiding Yu
Humphrey Shi
228
7
0
02 Apr 2025
GazeLLM: Multimodal LLMs incorporating Human Visual Attention
NASA/ESA Conference on Adaptive Hardware and Systems (AHS), 2025
Jun Rekimoto
213
2
0
31 Mar 2025
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1
Yi Chen
Yuying Ge
Rui Wang
Yixiao Ge
Lu Qiu
Mingyu Ding
Xihui Liu
ReLM
VLM
OffRL
LRM
297
14
0
31 Mar 2025
STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
Yongbin Li
Yujiao Shi
Tao Lin
Xiangrui Liu
Wenxiao Cai
Zhengyang Liang
Bo Zhao
LRM
598
35
0
31 Mar 2025
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
Computer Vision and Pattern Recognition (CVPR), 2025
Lucas Ventura
Antoine Yang
Cordelia Schmid
Gül Varol
270
4
0
31 Mar 2025
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs
Sanjoy Chowdhury
Hanan Gani
Nishit Anand
Sayan Nag
Ruohan Gao
Mohamed Elhoseiny
Salman Khan
Dinesh Manocha
LRM
435
6
0
29 Mar 2025
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts
Computer Vision and Pattern Recognition (CVPR), 2025
Yanjie Wang
Longji Xu
Bo Chen
Tong Wu
Dongyan Zhao
Zilong Zheng
VLM
MLLM
331
9
0
29 Mar 2025
FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs
Computer Vision and Pattern Recognition (CVPR), 2025
Xiaoqin Wang
Xusen Ma
Xianxu Hou
Meidan Ding
Yudong Li
Junliang Chen
Wenting Chen
Xiaoyang Peng
LinLin Shen
CVBM
330
7
0
27 Mar 2025
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng
Kaixiong Gong
Yangqiu Song
Zonghao Guo
Yibing Wang
Tianshuo Peng
Jian Wu
Xiaoying Zhang
Benyou Wang
Xiangyu Yue
AI4TS
SyDa
LRM
581
230
0
27 Mar 2025
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding
Computer Vision and Pattern Recognition (CVPR), 2025
Shuming Liu
Chen Zhao
Tianqi Xu
Bernard Ghanem
VLM
292
25
0
27 Mar 2025
MAVERIX: Multimodal Audio-Visual Evaluation and Recognition IndeX
Liuyue Xie
George Z. Wei
Avik Kuthiala
Ce Zheng
Ananya Bal
...
Rohan Choudhury
Morteza Ziyadi
Xu Zhang
Hao Yang
László A. Jeni
317
1
0
27 Mar 2025
Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving
Yue Li
Meng Tian
Zhenyu Lin
Jiangtong Zhu
Dechang Zhu
Haiqiang Liu
Zining Wang
Yueyi Zhang
Zhiwei Xiong
Xinhai Zhao
CoGe
VLM
366
11
0
27 Mar 2025
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
Yucheng Suo
Fan Ma
Linchao Zhu
T. Wang
Fengyun Rao
Yi Yang
LRM
316
5
0
26 Mar 2025
Qwen2.5-Omni Technical Report
Jin Xu
Zhifang Guo
Jinzheng He
Hangrui Hu
Ting He
...
K. Dang
Bin Zhang
Xinyu Wang
Yunfei Chu
Junyang Lin
VGen
AuLLM
1.2K
344
0
26 Mar 2025
FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs
Carlos Plou
Cesar Borja
Ruben Martinez-Cantin
Ana C. Murillo
339
0
0
25 Mar 2025
ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models
Dohwan Ko
S. Kim
Yumin Suh
Vijay Kumar B.G
Minseo Yoon
Manmohan Chandraker
Hyunwoo J. Kim
LRM
316
6
0
25 Mar 2025
Audio-centric Video Understanding Benchmark without Text Shortcut
Yue Yang
Jimin Zhuang
Guangzhi Sun
Changli Tang
Yongqian Li
P. Li
Yifan Jiang
W. Li
Tianhao Shen
Chao Zhang
AuLLM
CoGe
423
0
0
25 Mar 2025
PAVE: Patching and Adapting Video Large Language Models
Computer Vision and Pattern Recognition (CVPR), 2025
Zhuoming Liu
Yiquan Li
Khoi Duc Nguyen
Yiwu Zhong
Yin Li
KELM
LRM
362
1
0
25 Mar 2025
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
Mingze Xu
Mingfei Gao
Shiyu Li
Jiasen Lu
Zhe Gan
Zhengfeng Lai
Meng Cao
Kai Kang
Yue Yang
Afshin Dehghan
425
15
0
24 Mar 2025
Breaking the Encoder Barrier for Seamless Video-Language Understanding
Handong Li
Yiyuan Zhang
Longteng Guo
Xiangyu Yue
Jing Liu
VLM
322
3
0
24 Mar 2025
LLaVAction: evaluating and training multi-modal large language models for action understanding
Shaokai Ye
Haozhe Qi
Alexander Mathis
Mackenzie W. Mathis
361
0
0
24 Mar 2025
Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models
Meng Cao
Pengfei Hu
Yuhang Han
J. Gu
Haoran Tang
...
Jun Song
Xiang Li
Bo Zheng
Ian Reid
Xiaodan Liang
200
8
0
24 Mar 2025
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
Xiangrui Liu
Yan Shu
Zhengyang Liang
Ao Li
Yang Tian
Bo Zhao
VGen
VLM
539
30
0
24 Mar 2025
GOAL: Global-local Object Alignment Learning
Computer Vision and Pattern Recognition (CVPR), 2025
Hyungyu Choi
Young Kyun Jang
Chanho Eom
VLM
918
6
0
22 Mar 2025
V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction
Yiming Zhao
Y. Zeng
Yukun Qi
Yi Liu
Lin Yen-Chen
Zehui Chen
Xikun Bao
Jie Zhao
Feng Zhao
VLM
343
4
0
22 Mar 2025
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding
Wenxuan Zhu
Bing Li
Cheng Zheng
Jinjie Mai
Jun-Cheng Chen
...
Abdullah Hamdi
Sara Rojas Martinez
Chia-Wen Lin
Mohamed Elhoseiny
Bernard Ghanem
VLM
275
1
0
22 Mar 2025
Judge Anything: MLLM as a Judge Across Any Modality
Shu Pu
Yaochen Wang
Benlin Liu
Yuhang Chen
Guohao Wang
...
Zetong Zhou
Shuang Gong
Yi Gui
Yao Wan
Philip S. Yu
ELM
VLM
244
15
0
21 Mar 2025
What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation?
Xuanming Cui
Jaiminkumar Ashokbhai Bhoi
Chionh Wei Peng
Adriel Kuek
Ser-Nam Lim
278
0
0
20 Mar 2025
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models
Computer Vision and Pattern Recognition (CVPR), 2025
Zhihang Liu
Chen-Wei Xie
Nianzu Yang
Liming Zhao
Longxiang Tang
Yun Zheng
Chuanbin Liu
Hongtao Xie
VLM
236
14
0
20 Mar 2025
Agentic Keyframe Search for Video Question Answering
Sunqi Fan
Meng-Hao Guo
Shuojin Yang
216
3
0
20 Mar 2025
XAttention: Block Sparse Attention with Antidiagonal Scoring
Ruyi Xu
Guangxuan Xiao
Haofeng Huang
Junxian Guo
Enze Xie
336
55
0
20 Mar 2025
FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding
Chongjun Tu
Lin Zhang
Pengtao Chen
Peng Ye
Xianfang Zeng
Wei Cheng
Gang Yu
Tao Chen
356
8
0
19 Mar 2025
Improving LLM Video Understanding with 16 Frames Per Second
Yongqian Li
Changli Tang
Jimin Zhuang
Yudong Yang
Guangzhi Sun
W. Li
Tianhao Shen
Chao Zhang
VLM
420
11
0
18 Mar 2025
Impossible Videos
Zechen Bai
Hai Ci
Mike Zheng Shou
EGVM
VGen
317
7
0
18 Mar 2025
CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models
Computer Vision and Pattern Recognition (CVPR), 2025
Yiqi Zhu
Zihan Wang
Chen Zhang
Ziwei Sun
Yang Liu
CoGe
VLM
260
3
0
18 Mar 2025
Previous
1
2
3
...
10
11
7
8
9
Next