ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.21075
  4. Cited By
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
v1v2v3 (latest)

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

31 May 2024
Chaoyou Fu
Yuhan Dai
Yondong Luo
Lei Li
Shuhuai Ren
Renrui Zhang
Zihan Wang
Chenyu Zhou
Chunjiang Ge
Mengdan Zhang
Peixian Chen
Yanwei Li
Shaohui Lin
Zhengye Zhang
Ke Li
Tong Xu
Xiawu Zheng
Enhong Chen
Caifeng Shan
Xing Sun
Xing Sun
    VLMMLLM
ArXiv (abs)PDFHTMLHuggingFace (25 upvotes)

Papers citing "Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis"

50 / 550 papers shown
Inference Compute-Optimal Video Vision Language Models
Inference Compute-Optimal Video Vision Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Peiqi Wang
ShengYun Peng
Xuewen Zhang
Hanchao Yu
Yibo Yang
Lifu Huang
Fujun Liu
Qifan Wang
VLM
276
2
0
24 May 2025
Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities
Ziwei Zhou
Rui Wang
Zuxuan Wu
AuLLMVGen
200
22
0
23 May 2025
Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding
Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding
Xiaoyi Zhang
Zhaoyang Jia
Zongyu Guo
Jiahao Li
Bin Li
Houqiang Li
Yan Lu
641
14
0
23 May 2025
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
Zebin You
Shen Nie
Xiaolu Zhang
Jun Hu
Jun Zhou
Zhiwu Lu
J. Wen
Chongxuan Li
MLLMVLM
433
61
0
22 May 2025
CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms
CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms
Shilin Yan
Jiaming Han
Joey Tsai
Hongwei Xue
Rongyao Fang
Lingyi Hong
Ziyu Guo
Ray Zhang
VLM
298
7
0
22 May 2025
From Evaluation to Defense: Advancing Safety in Video Large Language Models
From Evaluation to Defense: Advancing Safety in Video Large Language Models
Yiwei Sun
Peiqi Jiang
Chuanbin Liu
Luohao Lin
Zhiying Lu
Hongtao Xie
198
1
0
22 May 2025
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design
Benjamin Schneider
Dongfu Jiang
Chao Du
Tianyu Pang
Wenhu Chen
VLM
235
4
0
22 May 2025
Clapper: Compact Learning and Video Representation in VLMs
Clapper: Compact Learning and Video Representation in VLMs
Lingyu Kong
Hongzhi Zhang
Jingyuan Zhang
Jianzhao Huang
Kunze Li
Qi Wang
Fuzheng Zhang
VLM
220
0
0
21 May 2025
Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency
Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal InconsistencyAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Jiafeng Liang
Shixin Jiang
Xuan Dong
Ning Wang
Zheng Chu
Hui Su
Jinlan Fu
Ming-Yuan Liu
See-Kiong Ng
Bing Qin
247
1
0
20 May 2025
VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation
VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation
Wentao Ma
Weiming Ren
Yiming Jia
Zhuofeng Li
Ping Nie
Ge Zhang
Wenhu Chen
262
6
0
20 May 2025
Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models
Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models
Xuyang Liu
Yiyu Wang
Junpeng Ma
Linfeng Zhang
VLM
172
11
0
20 May 2025
SurveillanceVQA-589K: A Benchmark for Comprehensive Surveillance Video-Language Understanding with Large Models
SurveillanceVQA-589K: A Benchmark for Comprehensive Surveillance Video-Language Understanding with Large Models
Bo Liu
Pengfei Qiao
Minhan Ma
Xuange Zhang
Yinan Tang
Peng Xu
Kun Liu
Tongtong Yuan
258
4
0
19 May 2025
Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models
Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models
Keunwoo Peter Yu
Joyce Chai
MLLMVLM
289
0
0
16 May 2025
Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation
Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation
Galann Pennec
Zhengyuan Liu
Nicholas Asher
Philippe Muller
Nancy F. Chen
VGen
464
1
0
10 May 2025
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
Haibo Wang
Bo Feng
Zhengfeng Lai
Mingze Xu
Shiyu Li
Weifeng Ge
Afshin Dehghan
Meng Cao
Ping Huang
OffRL
616
6
0
08 May 2025
SITE: towards Spatial Intelligence Thorough Evaluation
SITE: towards Spatial Intelligence Thorough Evaluation
Wenjie Wang
Reuben Tan
Pengyue Zhu
Jianwei Yang
Zhengyuan Yang
Lijuan Wang
Andrey Kolobov
Jianfeng Gao
Boqing Gong
290
6
0
08 May 2025
R^3-VQA: "Read the Room" by Video Social Reasoning
R^3-VQA: "Read the Room" by Video Social Reasoning
Lixing Niu
Jiapeng Li
Xingping Yu
Shu Wang
Ruining Feng
Bo Wu
Ping Wei
Longji Xu
Lifeng Fan
284
3
0
07 May 2025
RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video
RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video
Shuhang Xun
Sicheng Tao
Jiajun Li
Yibo Shi
Zhixin Lin
...
Shikang Wang
Wenshu Fan
Hao Zhang
Ying Ma
Xuming Hu
VLMLRM
379
5
0
04 May 2025
Grounding Task Assistance with Multimodal Cues from a Single Demonstration
Grounding Task Assistance with Multimodal Cues from a Single DemonstrationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Gabriel Sarch
Balasaravanan Thoravi Kumaravel
Sahithya Ravi
Vibhav Vineet
A. D. Wilson
944
0
0
02 May 2025
AVA: Towards Agentic Video Analytics with Vision Language Models
AVA: Towards Agentic Video Analytics with Vision Language Models
Yuxuan Yan
Shiqi Jiang
Ting Cao
Yifan Yang
Qianqian Yang
Yuanchao Shu
Yue Yang
Lili Qiu
VLM
533
0
0
01 May 2025
MINERVA: Evaluating Complex Video Reasoning
MINERVA: Evaluating Complex Video Reasoning
Arsha Nagrani
Sachit Menon
Ahmet Iscen
Shyamal Buch
Ramin Mehran
...
Yukun Zhu
Carl Vondrick
Mikhail Sirotenko
Cordelia Schmid
Tobias Weyand
339
8
0
01 May 2025
Static or Dynamic: Towards Query-Adaptive Token Selection for Video Question Answering
Static or Dynamic: Towards Query-Adaptive Token Selection for Video Question Answering
Yumeng Shi
Quanyu Long
Wenya Wang
303
1
0
30 Apr 2025
SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding
SeriesBench: A Benchmark for Narrative-Driven Drama Series UnderstandingComputer Vision and Pattern Recognition (CVPR), 2025
Yiming Lei
Chenkai Zhang
Ziqiang Liu
Haitao Leng
Shaoguo Liu
Tingting Gao
Qingjie Liu
Yunhong Wang
AI4TS
521
0
0
30 Apr 2025
UniversalRAG: Retrieval-Augmented Generation over Corpora of Diverse Modalities and Granularities
UniversalRAG: Retrieval-Augmented Generation over Corpora of Diverse Modalities and Granularities
Woongyeong Yeo
Kangsan Kim
Soyeong Jeong
Jinheon Baek
Sung Ju Hwang
400
1
0
29 Apr 2025
Learning Streaming Video Representation via Multitask Training
Learning Streaming Video Representation via Multitask Training
Yibin Yan
Jilan Xu
Shangzhe Di
Yikun Liu
Yudi Shi
Qirui Chen
Zeqian Li
Yifei Huang
Weidi Xie
CLL
503
3
0
28 Apr 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
Xuzhao Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Qi Zhang
Tat-Seng Chua
Tianwei Zhang
ALMELM
554
23
0
26 Apr 2025
ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding
ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding
Yi-Xing Peng
Q. Yang
Yu-Ming Tang
Shenghao Fu
Kun-Yu Lin
Xihan Wei
Wei-Shi Zheng
254
5
0
25 Apr 2025
VEU-Bench: Towards Comprehensive Understanding of Video Editing
VEU-Bench: Towards Comprehensive Understanding of Video EditingComputer Vision and Pattern Recognition (CVPR), 2025
Bozheng Li
Y. Wu
Yi Lu
Jiashuo Yu
Licheng Tang
Jiawang Cao
Wenqing Zhu
Yuyang Sun
Jay Wu
Wenbo Zhu
306
1
0
24 Apr 2025
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos
Linli Yao
You Li
Y. X. Wei
Lei Li
Shuhuai Ren
...
Sida Li
Dianbo Sui
Qi Liu
Yanzhe Zhang
Xu Sun
279
17
0
24 Apr 2025
FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding
FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding
De-An Huang
Subhashree Radhakrishnan
Zhiding Yu
Jan Kautz
VGenVLM
436
13
0
24 Apr 2025
VideoVista-CulturalLingo: 360$^\circ$ Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension
VideoVista-CulturalLingo: 360∘^\circ∘ Horizons-Bridging Cultures, Languages, and Domains in Video ComprehensionAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Xinyu Chen
Yunxin Li
Haoyuan Shi
Baotian Hu
Tong Lu
Yaowei Wang
Hao Fei
ELM
291
2
0
23 Apr 2025
Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark
Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark
Hanlei Zhang
Zhuohang Li
Yeshuang Zhu
Hua Xu
Peiwu Wang
Haige Zhu
Jie Zhou
Jinchao Zhang
415
2
0
23 Apr 2025
Sparsity Forcing: Reinforcing Token Sparsity of MLLMs
Sparsity Forcing: Reinforcing Token Sparsity of MLLMs
Feng Chen
Yefei He
Lequan Lin
Qingbin Liu
Bohan Zhuang
Qi Wu
Qi Wu
361
1
0
23 Apr 2025
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
Chris
Yichen Wei
Yi Peng
Xiang Wang
Weijie Qiu
...
Jianhao Zhang
Y. Hao
Xuchen Song
Yang Liu
Yahui Zhou
OffRLAI4TSSyDaLRMVLM
533
26
0
23 Apr 2025
Vidi: Large Multimodal Models for Video Understanding and Editing
Vidi: Large Multimodal Models for Video Understanding and Editing
Vidi Team
Celong Liu
Chia-Wen Kuo
Dawei Du
Fan Chen
...
Xiaohui Shen
Xin Gu
Xing Mei
Xueqiong Qu
Zhenfang Chen
359
7
0
22 Apr 2025
MR. Video: "MapReduce" is the Principle for Long Video Understanding
MR. Video: "MapReduce" is the Principle for Long Video Understanding
Ziqi Pang
Yu-Xiong Wang
VLM
275
7
0
22 Apr 2025
MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention
MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention
Yucheng Li
Huiqiang Jiang
Chengruidong Zhang
Qianhui Wu
Xufang Luo
...
Amir H. Abdi
Dongsheng Li
Jianfeng Gao
Yue Yang
Lili Qiu
352
17
0
22 Apr 2025
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes
Ji Qi
Yuan Yao
Yushi Bai
Bin Xu
Juanzi Li
Zhiyuan Liu
Tat-Seng Chua
271
5
0
21 Apr 2025
Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding
Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding
Tong Zeng
Longfeng Wu
Liang Shi
Dawei Zhou
Feng Guo
203
7
0
20 Apr 2025
Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
Enxin Song
Wenhao Chai
Weili Xu
Jianwen Xie
Yuxuan Liu
Gaoang Wang
398
20
0
20 Apr 2025
VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment
VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment
Yogesh Kulkarni
Pooyan Fazli
516
4
0
18 Apr 2025
VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models
VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models
Haojian Huang
Haodong Chen
Shengqiong Wu
Meng Luo
Jinlan Fu
Xinya Du
Hao Zhang
Hao Fei
AI4TS
977
8
0
17 Apr 2025
Perception Encoder: The best visual embeddings are not at the output of the network
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya
Po-Yao (Bernie) Huang
Peize Sun
Jang Hyun Cho
Andrea Madotto
...
Shiyu Dong
Nikhila Ravi
Daniel Li
Piotr Dollár
Christoph Feichtenhofer
ObjDVOS
666
107
0
17 Apr 2025
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization
Pritam Sarkar
Ali Etemad
328
2
0
16 Apr 2025
TerraMind: Large-Scale Generative Multimodality for Earth Observation
TerraMind: Large-Scale Generative Multimodality for Earth Observation
Johannes Jakubik
Felix Yang
Benedikt Blumenstiel
Erik Scheurer
Rocco Sedona
...
P. Fraccaro
Thomas Brunschwiler
Gabriele Cavallaro
Juan Bernabé-Moreno
Alessandra Feliciotti
MLLMVLM
490
39
0
15 Apr 2025
Reimagining Urban Science: Scaling Causal Inference with Large Language Models
Reimagining Urban Science: Scaling Causal Inference with Large Language Models
Yutong Xia
Ao Qu
Yunhan Zheng
Yihong Tang
Dingyi Zhuang
...
Cathy Wu
Roger Zimmermann
Lijun Sun
Roger Zimmermann
Jinhua Zhao
AI4CE
952
2
0
15 Apr 2025
Multimodal Long Video Modeling Based on Temporal Dynamic Context
Multimodal Long Video Modeling Based on Temporal Dynamic Context
Haoran Hao
Jiaming Han
Yiyuan Zhang
Xiangyu Yue
495
0
0
14 Apr 2025
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu
Weiyun Wang
Zhe Chen
Ziwei Liu
Shenglong Ye
...
Dahua Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
Wei Wang
MLLMVLM
596
790
1
14 Apr 2025
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
Yang Shi
Jiaheng Liu
Yushuo Guan
Zhikai Wu
Yujiao Shi
...
Bohan Zeng
Wei Zhang
Fuzheng Zhang
Wenjing Yang
Di Zhang
VGenVLM
380
11
0
14 Apr 2025
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
Xingjian Zhang
Siwei Wen
Wenjun Wu
Lei Huang
LRM
346
40
0
13 Apr 2025
Previous
123...10116789
Next