Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2405.21075
Cited By
v1
v2
v3 (latest)
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
31 May 2024
Chaoyou Fu
Yuhan Dai
Yondong Luo
Lei Li
Shuhuai Ren
Renrui Zhang
Zihan Wang
Chenyu Zhou
Chunjiang Ge
Mengdan Zhang
Peixian Chen
Yanwei Li
Shaohui Lin
Zhengye Zhang
Ke Li
Tong Xu
Xiawu Zheng
Enhong Chen
Caifeng Shan
Xing Sun
Xing Sun
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (25 upvotes)
Papers citing
"Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis"
50 / 550 papers shown
Title
Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment
Kai-Po Chang
Wei-Yuan Cheng
Chi-Pin Huang
Fu-En Yang
Yu-Jie Wang
204
0
0
04 Dec 2025
StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios
Y. X. R. Wang
Zhenkai Li
Tianwen Qian
Huanran Zheng
Zheng Wang
Yuqian Fu
Xiaoling Wang
4
0
0
04 Dec 2025
SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding
Chang-Hsun Wu
Kai-Po Chang
Yu-Yang Sheng
Hung-Kai Chung
Kuei-Chun Wang
Yu-Jie Wang
MLLM
210
0
0
04 Dec 2025
PhyVLLM: Physics-Guided Video Language Model with Motion-Appearance Disentanglement
Yu-Wei Zhan
Xin Wang
Hong Chen
Tongtong Feng
Wei Feng
Ren Wang
Guangyao Li
Qing Li
Wenwu Zhu
VGen
243
0
0
04 Dec 2025
Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding
J. Li
Bin Li
Jiahao Li
Yan Lu
64
0
0
03 Dec 2025
EEA: Exploration-Exploitation Agent for Long Video Understanding
Te Yang
Xiangyu Zhu
Bo Wang
Quan Chen
Peng Jiang
Zhen Lei
56
0
0
03 Dec 2025
UniComp: Rethinking Video Compression Through Informational Uniqueness
Chao Yuan
Shimin Chen
Minliang Lin
Limeng Qiao
Guanglu Wan
Lin Ma
148
0
0
03 Dec 2025
TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning
Tao Wu
Li Yang
Gen Zhan
Y. Zhang
Yiting Liao
Junlin Li
Deliang Fu
Li Zhang
Limin Wang
AI4TS
VLM
LRM
187
0
0
03 Dec 2025
PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation
Xiaolong Li
Youping Gu
Xi Lin
Weijie Wang
Bohan Zhuang
72
0
0
03 Dec 2025
OneThinker: All-in-one Reasoning Model for Image and Video
Kaituo Feng
M. Zhang
Hongyu Li
Kaixuan Fan
Shuang Chen
...
Haoze Sun
Yan Feng
Peng Pei
Xunliang Cai
Xiangyu Yue
OffRL
MLLM
VLM
LRM
635
3
0
02 Dec 2025
MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm
Wei Chen
Chaoqun Du
Feng Gu
Wei He
Qizhen Li
...
Pengfei Yu
Y. Zheng
Chunpeng Zhou
Pan Zhou
Xuhan Zhu
MLLM
OffRL
VLM
621
1
0
02 Dec 2025
MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation
Youxin Pang
Jiajun Liu
L. Tan
Yong Zhang
Feng Gao
Xiang Deng
Zhuoliang Kang
Xiaoming Wei
Y. Liu
VGen
87
0
0
02 Dec 2025
WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
Woongyeong Yeo
Kangsan Kim
Jaehong Yoon
Sung Ju Hwang
LLMAG
VGen
VLM
336
0
0
02 Dec 2025
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
Le Thien Phuc Nguyen
Zhuoran Yu
Samuel Low Yu Hang
Subin An
J. Lee
...
SeungEun Chung
Thanh-Huy Nguyen
JuWan Maeng
Soochahn Lee
Yong Jae Lee
AuLLM
VLM
194
0
0
01 Dec 2025
Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for Multimodal Large Language Models
Zhongyu Yang
Dannong Xu
Wei Pang
Yingfang Yuan
VLM
180
0
0
01 Dec 2025
TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models
Zhiheng Liu
Weiming Ren
Haozhe Liu
Zijian Zhou
S. Chen
...
Ping Luo
Wei Liu
Tao Xiang
Jonas Schult
Yuren Cong
120
0
0
01 Dec 2025
ViRectify: A Challenging Benchmark for Video Reasoning Correction with Multimodal Large Language Models
Xusen Hei
Jiali Chen
Jinyu Yang
Mengchen Zhao
Yi Cai
LRM
112
0
0
01 Dec 2025
PAI-Bench: A Comprehensive Benchmark For Physical AI
Fengzhe Zhou
Jiannan Huang
Jialuo Li
Deva Ramanan
Humphrey Shi
VGen
148
0
0
01 Dec 2025
Accelerating Streaming Video Large Language Models via Hierarchical Token Compression
Yiyu Wang
Xuyang Liu
Xiyan Gui
Xinying Lin
B. Yang
Chenfei Liao
Tailai Chen
Linfeng Zhang
48
0
0
30 Nov 2025
Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding
Pengfei Hu
Meng Cao
Y. Wang
Yi Wang
Jiahua Dong
Jun Song
Yu Cheng
Bo Zheng
Xiaodan Liang
LRM
VLM
117
0
0
30 Nov 2025
REM: Evaluating LLM Embodied Spatial Reasoning through Multi-Frame Trajectories
Jacob Thompson
Emiliano Garcia-Lopez
Yonatan Bisk
LRM
106
0
0
30 Nov 2025
Video-CoM: Interactive Video Reasoning via Chain of Manipulations
H. Rasheed
Mohammed Zumri
Muhammad Maaz
Ming-Hsuan Yang
Fahad Shahbaz Khan
Salman Khan
AI4TS
LRM
141
0
0
28 Nov 2025
A Rosetta Stone for AI Benchmarks
A. Ho
Jean-Stanislas Denain
David Atanasov
Samuel Albanie
Rohin Shah
ELM
248
0
0
28 Nov 2025
Qwen3-VL Technical Report
Shuai Bai
Yuxuan Cai
Ruizhe Chen
Keqin Chen
Xionghui Chen
...
Jingren Zhou
F. I. S. Kevin Zhou
J. Zhou
Yuanzhi Zhu
Ke Zhu
VLM
1.4K
44
0
26 Nov 2025
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
Peiran Xu
Sudong Wang
Yao Zhu
Jianing Li
Yunjian Zhang
LRM
330
1
0
26 Nov 2025
WaymoQA: A Multi-View Visual Question Answering Dataset for Safety-Critical Reasoning in Autonomous Driving
Seungjun Yu
Seonho Lee
Namho Kim
Jaeyo Shin
J. Park
Wonjeong Ryu
Raehyuk Jung
Hyunjung Shim
LRM
218
0
0
25 Nov 2025
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
Zuhao Yang
Sudong Wang
Kaichen Zhang
Keming Wu
Sicong Leng
...
Bo Li
Chengwei Qin
Shijian Lu
X. Li
Lidong Bing
LRM
VLM
126
4
0
25 Nov 2025
Vision-Language Memory for Spatial Reasoning
Zuntao Liu
Yi Du
Taimeng Fu
Shaoshu Su
Cherie Ho
Chen Wang
VLM
LRM
229
0
0
25 Nov 2025
Unboxing the Black Box: Mechanistic Interpretability for Algorithmic Understanding of Neural Networks
Bianka Kowalska
Halina Kwaśnicka
147
0
0
24 Nov 2025
Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents
Dayong Liu
Chao Xu
Weihong Chen
Suyu Zhang
Juncheng Wang
Jiankang Deng
Baigui Sun
Yang Liu
LM&Ro
257
0
0
24 Nov 2025
Be My Eyes: Extending Large Language Models to New Modalities Through Multi-Agent Collaboration
James Y. Huang
Sheng Zhang
Qianchu Liu
Guanghui Qin
Tinghui Zhu
Tristan Naumann
Muhao Chen
Hoifung Poon
VLM
LRM
133
0
0
24 Nov 2025
VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning
Boyu Chen
Zikang Wang
Zhengrong Yue
Kainan Yan
Chenyun Yu
...
Yafei Wen
Xiaoxin Chen
Yang Liu
Peng Li
Yali Wang
LLMAG
304
3
0
24 Nov 2025
Vidi2: Large Multimodal Models for Video Understanding and Creation
Vidi Team
Celong Liu
Chia-Wen Kuo
Chuang Huang
Dawei Du
...
Wen Zhong
Xiaohui Shen
Xin Gu
Zhenfang Chen
Zuhua Lin
60
0
0
24 Nov 2025
OrdMoE: Preference Alignment via Hierarchical Expert Group Ranking in Multimodal Mixture-of-Experts LLMs
Yuting Gao
Weihao Chen
L. xilinx Wang
Ruihan Xu
Q. Guo
MoE
116
0
0
24 Nov 2025
VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models
Fufangchen Zhao
Liao Zhang
Daiqi Shi
Yuanjun Gao
Chen Ye
Yang Cai
Jian Gao
Danfeng Yan
VLM
133
0
0
24 Nov 2025
EgoVITA: Learning to Plan and Verify for Egocentric Video Reasoning
Yogesh Kulkarni
Pooyan Fazli
EgoV
LRM
349
0
0
23 Nov 2025
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
Xiyang Wu
Zongxia Li
Jihui Jin
Guangyao Shi
Gouthaman KV
Vishnu Raj
Nilotpal Sinha
Jingxi Chen
Fan Du
Dinesh Manocha
124
0
0
23 Nov 2025
AnyExperts: On-Demand Expert Allocation for Multimodal Language Models with Mixture of Expert
Yuting Gao
Wang Lan
Hengyuan Zhao
Linjiang Huang
Si Liu
Q. Guo
MoE
164
0
0
23 Nov 2025
ChineseVideoBench: Benchmarking Multi-modal Large Models for Chinese Video Question Answering
Yuxiang Nie
Han Wang
Yongjie Ye
Haiyang Yu
Weitao Jia
...
Zehui Dai
Jiacong Wang
Dingkang Yang
An-Lan Wang
Can Huang
ELM
100
0
0
23 Nov 2025
EventBench: Towards Comprehensive Benchmarking of Event-based MLLMs
Shaoyu Liu
Jianing Li
Guanghui Zhao
Y. Zhang
Xiangyang Ji
69
0
0
23 Nov 2025
Test-Time Temporal Sampling for Efficient MLLM Video Understanding
Kaibin Wang
Mingbao Lin
96
0
0
22 Nov 2025
Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination
Y. Tang
Daiki Shimada
Hang Hua
Chao Huang
Jing Bi
Rogerio Feris
Chenliang Xu
225
0
0
21 Nov 2025
TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
Boshen Xu
Zihan Xiao
Jiaze Li
Jianzhong Ju
Zhenbo Luo
Jian Luan
Qin Jin
Mamba
515
0
0
20 Nov 2025
VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning
Zishan Xu
Yifu Guo
Y. Lu
Fengyu Yang
J. Li
VOS
216
1
0
20 Nov 2025
A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models
Duo Li
Zuhao Yang
Xiaoqin Zhang
Ling Shao
Shijian Lu
VLM
146
1
0
19 Nov 2025
MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
Yushi Huang
Z. Wang
Zhihang Yuan
Yifu Ding
Ruihao Gong
Jinyang Guo
Xianglong Liu
Jun Zhang
MoE
VLM
240
1
0
19 Nov 2025
Multimodal Evaluation of Russian-language Architectures
Artem Chervyakov
Ulyana Isaeva
Anton A. Emelyanov
Artem Safin
Maria Tikhonova
...
Ilseyar Alimova
Ilseyar Alimova
A. Kapitanov
Alena Fenogenova
Alena Fenogenova
290
1
0
19 Nov 2025
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models
Keda Tao
Kele Shao
Bohan Yu
Weiqiang Wang
Jian Liu
Huan Wang
VLM
241
2
0
18 Nov 2025
Minimax Multi-Target Conformal Prediction with Applications to Imaging Inverse Problems
Jeffrey Wen
Rizwan Ahmad
Philip Schniter
MedIm
323
0
0
17 Nov 2025
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding
Jiaze Li
Hao Yin
Wenhui Tan
Jingyang Chen
Boshen Xu
Yuxun Qu
Yijing Chen
Jianzhong Ju
Zhenbo Luo
Jian Luan
LRM
VLM
230
1
0
17 Nov 2025
1
2
3
4
...
9
10
11
Next