ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.21075
  4. Cited By
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
v1v2v3 (latest)

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

31 May 2024
Chaoyou Fu
Yuhan Dai
Yondong Luo
Lei Li
Shuhuai Ren
Renrui Zhang
Zihan Wang
Chenyu Zhou
Chunjiang Ge
Mengdan Zhang
Peixian Chen
Yanwei Li
Shaohui Lin
Zhengye Zhang
Ke Li
Tong Xu
Xiawu Zheng
Enhong Chen
Caifeng Shan
Xing Sun
Xing Sun
    VLMMLLM
ArXiv (abs)PDFHTMLHuggingFace (25 upvotes)

Papers citing "Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis"

50 / 550 papers shown
Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data
Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data
Yunxin Li
Xinyu Chen
Shenyuan Jiang
Haoyuan Shi
Zhenyu Liu
...
Zhenran Xu
Yicheng Ma
Meishan Zhang
Baotian Hu
Min Zhang
MLLMMoEOSLMVLM
617
1
0
16 Nov 2025
ReaSon: Reinforced Causal Search with Information Bottleneck for Video Understanding
ReaSon: Reinforced Causal Search with Information Bottleneck for Video Understanding
Yuan Zhou
Litao Hua
Shilong Jin
Wentao Huang
Haoran Duan
CMLVGen
252
0
0
16 Nov 2025
Striking the Right Balance between Compute and Copy: Improving LLM Inferencing Under Speculative Decoding
Striking the Right Balance between Compute and Copy: Improving LLM Inferencing Under Speculative Decoding
Arun Ramachandran
Ramaswamy Govindarajan
M. Annavaram
Prakash Raghavendra
Hossein Entezari Zarch
Lei Gao
Chaoyi Jiang
149
0
0
15 Nov 2025
CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models
CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models
Jingyao Li
Jingyun Wang
Molin Tan
Haochen Wang
Cilin Yan
Likun Shi
Jiayin Cai
Xiaolong Jiang
Yao Hu
VLMLRM
213
0
0
15 Nov 2025
OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs
OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs
Feng Chen
Yefei He
Shaoxuan He
Yuanyu He
Jing Liu
...
Zhaoyang Li
Jiyuan Zhang
Zhenbang Sun
Bohan Zhuang
Qi Wu
VLM
192
0
0
15 Nov 2025
Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models
Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models
Siyou Li
Huanan Wu
Juexi Shao
Yinghao Ma
Yujian Gan
...
Lu Wang
Wengqing Wu
Le Zhang
Massimo Poesio
Juntao Yu
VLM
165
0
0
14 Nov 2025
Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning
Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning
Jialong Qin
Xin Zou
Di Lu
Yibo Yan
Xuming Hu
VLM
256
0
0
11 Nov 2025
UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
Z. Liang
D. Zhang
Huichi Zhou
Rui Huang
Bobo Li
...
Shengqiong Wu
X. Wang
Jiebo Luo
Lizi Liao
Hao Fei
VGen
204
0
0
11 Nov 2025
StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression
StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression
Yilong Chen
Xiang Bai
Zhibin Wang
Chengyu Bai
Yuhan Dai
Ming Lu
Shanghang Zhang
144
2
0
10 Nov 2025
TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning
TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning
Junwen Pan
Qizhe Zhang
Rui Zhang
Ming Lu
Xin Wan
Yuan Zhang
Chang Liu
Qi She
AI4TS
124
0
0
07 Nov 2025
SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding
SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding
Ellis L Brown
Arijit Ray
Ranjay Krishna
Ross B. Girshick
Rob Fergus
Saining Xie
363
6
0
06 Nov 2025
Cambrian-S: Towards Spatial Supersensing in Video
Cambrian-S: Towards Spatial Supersensing in Video
Shusheng Yang
J. Yang
Pinzhi Huang
Ellis L Brown
Zihao Yang
...
Daohan Lu
Rob Fergus
Yann LeCun
Li Fei-Fei
Saining Xie
178
17
0
06 Nov 2025
NVIDIA Nemotron Nano V2 VL
NVIDIA Nemotron Nano V2 VL
Nvidia
Amala Sanjay Deshmukh
Kateryna Chumachenko
Tuomas Rintamaki
Matthieu Le
...
Krzysztof Pawelec
Michael Evans
Katherine Luna
Jie Lou
Erick Galinkin
VLM
311
2
0
06 Nov 2025
Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts
Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts
Ellis L Brown
Jihan Yang
Shusheng Yang
Rob Fergus
Saining Xie
VLM
230
5
0
06 Nov 2025
Rethinking Facial Expression Recognition in the Era of Multimodal Large Language Models: Benchmark, Datasets, and Beyond
Rethinking Facial Expression Recognition in the Era of Multimodal Large Language Models: Benchmark, Datasets, and Beyond
Fan Zhang
Haoxuan Li
Shengju Qian
Xin Wang
Zheng Lian
...
Yuan Gao
Qiankun Li
Yefeng Zheng
Zhouchen Lin
Pheng-Ann Heng
LRM
139
0
0
01 Nov 2025
FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding
FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding
Janghoon Cho
Jungsoo Lee
Munawar Hayat
Kyuwoong Hwang
Fatih Porikli
Sungha Choi
84
0
0
31 Oct 2025
LongCat-Flash-Omni Technical Report
LongCat-Flash-Omni Technical Report
M-A-P Team
Bairui Wang
Bayan
Bin Xiao
Bo Zhang
...
Xin Pan
Xin Chen
Xiusong Sun
Xu Xiang
X. Xing
MLLMVLM
590
5
0
31 Oct 2025
FOCUS: Efficient Keyframe Selection for Long Video Understanding
FOCUS: Efficient Keyframe Selection for Long Video Understanding
Zirui Zhu
Hailun Xu
Yang Luo
Yong Liu
Kanchan Sarkar
Zhenheng Yang
Yang You
159
0
0
31 Oct 2025
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
Ziyu Guo
Xinyan Chen
Renrui Zhang
Ruichuan An
Yu Qi
Dongzhi Jiang
Xiangtai Li
M. Zhang
Jiaming Song
Pheng-Ann Heng
VGenLRM
197
13
0
30 Oct 2025
EgoExo-Con: Exploring View-Invariant Video Temporal Understanding
EgoExo-Con: Exploring View-Invariant Video Temporal Understanding
Minjoon Jung
Junbin Xiao
Junghyun Kim
Byoung-Tak Zhang
Angela Yao
127
1
0
30 Oct 2025
Seeing, Signing, and Saying: A Vision-Language Model-Assisted Pipeline for Sign Language Data Acquisition and Curation from Social Media
Seeing, Signing, and Saying: A Vision-Language Model-Assisted Pipeline for Sign Language Data Acquisition and Curation from Social Media
Shakib Yazdani
Yasser Hamidullah
C. España-Bonet
Josef van Genabith
SLR
250
1
0
29 Oct 2025
Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders
Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders
Ali Rasekh
Erfan Bagheri Soula
Omid Daliran
Simon Gottschalk
Mohsen Fayyaz
94
1
0
29 Oct 2025
Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation
Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation
Inclusion AI
Bowen Ma
Cheng Zou
C. Yan
Chunxiang Jin
...
Zhiqiang Fang
Zhihao Qiu
Ziyuan Huang
Zizheng Yang
Z. He
MLLMMoEVLM
350
2
0
28 Oct 2025
Revisiting Multimodal Positional Encoding in Vision-Language Models
Revisiting Multimodal Positional Encoding in Vision-Language Models
Jie Huang
Xuejing Liu
Sibo Song
Ruibing Hou
Hong Chang
Junyang Lin
S. Bai
161
2
0
27 Oct 2025
EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT
EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT
Baoqi Pei
Yifei Huang
Jilan Xu
Yuping He
Guo Chen
Fei Wu
Yu Qiao
Jiangmiao Pang
EgoVLRM
215
4
0
27 Oct 2025
Positional Preservation Embedding for Multimodal Large Language Models
Positional Preservation Embedding for Multimodal Large Language Models
Mouxiao Huang
Borui Jiang
Dehua Zheng
Hailin Hu
Kai Han
Xinghao Chen
VLM
287
0
0
27 Oct 2025
MUStReason: A Benchmark for Diagnosing Pragmatic Reasoning in Video-LMs for Multimodal Sarcasm Detection
MUStReason: A Benchmark for Diagnosing Pragmatic Reasoning in Video-LMs for Multimodal Sarcasm Detection
Anisha Saha
Varsha Suresh
Timothy Hospedales
Vera Demberg
LRM
83
0
0
27 Oct 2025
Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence
Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence
Kun Ouyang
Yuanxin Liu
Linli Yao
Yishuo Cai
Hao Zhou
Jie Zhou
Fandong Meng
Xu Sun
OffRLLRMReLM
390
1
0
23 Oct 2025
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence
Jiahao Meng
X. Li
Haochen Wang
Yue Tan
Tao Zhang
...
Yunhai Tong
Anran Wang
Zhiyang Teng
Y. Wang
Z. Wang
VGenLRM
334
6
0
23 Oct 2025
SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding
SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding
Yuan Sheng
Y. Hao
Chenxu Li
Shuo Wang
Xiangnan He
96
0
0
23 Oct 2025
[De|Re]constructing VLMs' Reasoning in Counting
[De|Re]constructing VLMs' Reasoning in Counting
Simone Alghisi
Gabriel Roccabruna
Massimo Rizzoli
Seyed Mahed Mousavi
Giuseppe Riccardi
ReLMLRMVLM
205
1
0
22 Oct 2025
Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation
Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation
Su Ho Han
Jeongseok Hyun
Pilhyeon Lee
Minho Shim
Dongyoon Wee
Seon Joo Kim
VOSVLM
241
0
0
22 Oct 2025
IF-VidCap: Can Video Caption Models Follow Instructions?
IF-VidCap: Can Video Caption Models Follow Instructions?
S. Li
Y. Zhang
J. Wu
Zhide Lei
Yiwen He
...
Yingshui Tan
Y. Wang
Qianqian Xie
Zhaoxiang Zhang
Jiaheng Liu
VLM
151
2
0
21 Oct 2025
StreamingTOM: Streaming Token Compression for Efficient Video Understanding
StreamingTOM: Streaming Token Compression for Efficient Video Understanding
Xueyi Chen
Keda Tao
Kele Shao
Huan Wang
198
2
0
21 Oct 2025
SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference
SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference
Samir Khaki
Junxian Guo
Jiaming Tang
Shang Yang
Yukang Chen
Konstantinos N. Plataniotis
Yao Lu
Song Han
Zhijian Liu
MLLMVLM
183
1
0
20 Oct 2025
MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues
MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues
Yaning Pan
Z. Wang
Qianqian Xie
Yongqian Wen
Y. Zhang
...
An Ping
Tianhao Peng
Jiaheng Liu
Tianhao Peng
Jiaheng Liu
165
4
0
20 Oct 2025
Video Reasoning without Training
Video Reasoning without Training
Deepak Sridhar
K. Bhardwaj
Jeya Pradha Jeyaraj
Nuno Vasconcelos
Ankita Nayak
Harris Teague
OffRLLRM
192
1
0
19 Oct 2025
VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs
VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs
Jiaying Zhu
Yurui Zhu
Xin Lu
Wenrui Yan
Dong Li
Kunlin Liu
Xueyang Fu
Zheng-Jun Zha
MQVLM
254
0
0
18 Oct 2025
Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning
Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning
Xuchen Li
Xuzhao Li
Shiyu Hu
Kaiqi Huang
91
1
0
17 Oct 2025
VISTA: A Test-Time Self-Improving Video Generation Agent
VISTA: A Test-Time Self-Improving Video Generation Agent
Do Xuan Long
Xingchen Wan
Hootan Nakhost
Chen-Yu Lee
Tomas Pfister
Sercan Ö. Arık
VGenTTA
250
3
0
17 Oct 2025
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
Hanrong Ye
Chao-Han Huck Yang
Arushi Goel
Wei Huang
Ligeng Zhu
...
Andrew Tao
Song Han
Jan Kautz
Hongxu Yin
Pavlo Molchanov
187
3
0
17 Oct 2025
XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models
XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models
Xingrui Wang
Jiang Liu
Chao Huang
X. Yu
Ze Wang
Ximeng Sun
Jialian Wu
Alan Yuille
Emad Barsoum
Zicheng Liu
VLM
101
0
0
16 Oct 2025
MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning
MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning
Xukai Wang
Xuanbo Liu
Mingrui Chen
Haitian Zhong
Xuanlin Yang
...
Xu-Yao Zhang
Qiang Liu
Zhouchen Lin
Wentao Zhang
Bin Dong
ELMLRM
164
1
0
16 Oct 2025
VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning
VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning
Jinglei Zhang
Yuanfan Guo
Rolandos Alexandros Potamias
Jiankang Deng
Hang Xu
Chao Ma
LRM
124
2
0
16 Oct 2025
Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding
Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding
Xiaoqian Shen
Wenxuan Zhang
Jun-Cheng Chen
Mohamed Elhoseiny
VLMLRM
114
5
0
15 Oct 2025
MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models
MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models
Keyan Zhou
Zecheng Tang
Lingfeng Ming
G. Zhou
Qiguang Chen
...
Zheming Yang
Libo Qin
Minghui Qiu
Juntao Li
Min Zhang
107
0
0
15 Oct 2025
Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception
Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception
Ziyang Ma
Ruiyang Xu
Zhenghao Xing
Yunfei Chu
Yuping Wang
...
Pheng-Ann Heng
Kai Yu
Junyang Lin
Eng Siong Chng
Xie Chen
VLM
90
2
0
14 Oct 2025
An Empirical Study for Representations of Videos in Video Question Answering via MLLMs
An Empirical Study for Representations of Videos in Video Question Answering via MLLMs
Zhi Li
Yanan Wang
Hao Niu
Julio Vizcarra
Masato Taya
88
0
0
14 Oct 2025
K-frames: Scene-Driven Any-k Keyframe Selection for long video understanding
K-frames: Scene-Driven Any-k Keyframe Selection for long video understanding
Yifeng Yao
Yike Yun
Jing Wang
Huishuai Zhang
Dongyan Zhao
Ke Tian
Zhihao Wang
Minghui Qiu
Tao Wang
CLIPVGen
136
1
0
14 Oct 2025
MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites
MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites
Zhenxin Lei
Zhangwei Gao
Changyao Tian
Erfei Cui
Guanzhou Chen
...
Xiangyu Zhao
Jiayi Ji
Yu Qiao
Wenhai Wang
Gen Luo
VLM
248
0
0
14 Oct 2025
Previous
12345...91011
Next