Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2405.21075
Cited By
v1
v2
v3 (latest)
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
31 May 2024
Chaoyou Fu
Yuhan Dai
Yondong Luo
Lei Li
Shuhuai Ren
Renrui Zhang
Zihan Wang
Chenyu Zhou
Chunjiang Ge
Mengdan Zhang
Peixian Chen
Yanwei Li
Shaohui Lin
Zhengye Zhang
Ke Li
Tong Xu
Xiawu Zheng
Enhong Chen
Caifeng Shan
Xing Sun
Xing Sun
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (25 upvotes)
Papers citing
"Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis"
50 / 543 papers shown
Title
PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation
Xiaolong Li
Youping Gu
Xi Lin
Weijie Wang
Bohan Zhuang
64
0
0
03 Dec 2025
UniComp: Rethinking Video Compression Through Informational Uniqueness
Chao Yuan
Shimin Chen
Minliang Lin
Limeng Qiao
Guanglu Wan
Lin Ma
132
0
0
03 Dec 2025
EEA: Exploration-Exploitation Agent for Long Video Understanding
Te Yang
Xiangyu Zhu
Bo Wang
Quan Chen
Peng Jiang
Zhen Lei
32
0
0
03 Dec 2025
Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding
J. Li
Bin Li
Jiahao Li
Yan Lu
40
0
0
03 Dec 2025
OneThinker: All-in-one Reasoning Model for Image and Video
Kaituo Feng
M. Zhang
Hongyu Li
Kaixuan Fan
Shuang Chen
...
Haoze Sun
Yan Feng
Peng Pei
Xunliang Cai
Xiangyu Yue
OffRL
MLLM
VLM
LRM
610
3
0
02 Dec 2025
MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm
Wei Chen
Chaoqun Du
Feng Gu
Wei He
Qizhen Li
...
Pengfei Yu
Y. Zheng
Chunpeng Zhou
Pan Zhou
Xuhan Zhu
MLLM
OffRL
VLM
605
1
0
02 Dec 2025
WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
Woongyeong Yeo
Kangsan Kim
Jaehong Yoon
Sung Ju Hwang
LLMAG
VGen
VLM
324
0
0
02 Dec 2025
MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation
Youxin Pang
Jiajun Liu
L. Tan
Yong Zhang
Feng Gao
Xiang Deng
Zhuoliang Kang
Xiaoming Wei
Y. Liu
VGen
79
0
0
02 Dec 2025
TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models
Zhiheng Liu
Weiming Ren
Haozhe Liu
Zijian Zhou
S. Chen
...
Ping Luo
Wei Liu
Tao Xiang
Jonas Schult
Yuren Cong
120
0
0
01 Dec 2025
PAI-Bench: A Comprehensive Benchmark For Physical AI
Fengzhe Zhou
Jiannan Huang
Jialuo Li
Deva Ramanan
Humphrey Shi
VGen
144
0
0
01 Dec 2025
Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for Multimodal Large Language Models
Zhongyu Yang
Dannong Xu
Wei Pang
Yingfang Yuan
VLM
164
0
0
01 Dec 2025
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
Le Thien Phuc Nguyen
Zhuoran Yu
Samuel Low Yu Hang
Subin An
J. Lee
...
SeungEun Chung
Thanh-Huy Nguyen
JuWan Maeng
Soochahn Lee
Yong Jae Lee
AuLLM
VLM
182
0
0
01 Dec 2025
Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding
Pengfei Hu
Meng Cao
Y. Wang
Yi Wang
Jiahua Dong
Jun Song
Yu Cheng
Bo Zheng
Xiaodan Liang
LRM
VLM
117
0
0
30 Nov 2025
REM: Evaluating LLM Embodied Spatial Reasoning through Multi-Frame Trajectories
Jacob Thompson
Emiliano Garcia-Lopez
Yonatan Bisk
LRM
98
0
0
30 Nov 2025
Accelerating Streaming Video Large Language Models via Hierarchical Token Compression
Yiyu Wang
Xuyang Liu
Xiyan Gui
Xinying Lin
B. Yang
Chenfei Liao
Tailai Chen
Linfeng Zhang
44
0
0
30 Nov 2025
Video-CoM: Interactive Video Reasoning via Chain of Manipulations
H. Rasheed
Mohammed Zumri
Muhammad Maaz
Ming-Hsuan Yang
Fahad Shahbaz Khan
Salman Khan
AI4TS
LRM
133
0
0
28 Nov 2025
A Rosetta Stone for AI Benchmarks
A. Ho
Jean-Stanislas Denain
David Atanasov
Samuel Albanie
Rohin Shah
ELM
228
0
0
28 Nov 2025
Qwen3-VL Technical Report
Shuai Bai
Yuxuan Cai
Ruizhe Chen
Keqin Chen
Xionghui Chen
...
Jingren Zhou
F. I. S. Kevin Zhou
J. Zhou
Yuanzhi Zhu
Ke Zhu
VLM
1.2K
39
0
26 Nov 2025
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
Peiran Xu
Sudong Wang
Yao Zhu
Jianing Li
Yunjian Zhang
LRM
326
0
0
26 Nov 2025
Vision-Language Memory for Spatial Reasoning
Zuntao Liu
Yi Du
Taimeng Fu
Shaoshu Su
Cherie Ho
Chen Wang
VLM
LRM
221
0
0
25 Nov 2025
WaymoQA: A Multi-View Visual Question Answering Dataset for Safety-Critical Reasoning in Autonomous Driving
Seungjun Yu
Seonho Lee
Namho Kim
Jaeyo Shin
J. Park
Wonjeong Ryu
Raehyuk Jung
Hyunjung Shim
LRM
218
0
0
25 Nov 2025
VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models
Fufangchen Zhao
Liao Zhang
Daiqi Shi
Yuanjun Gao
Chen Ye
Yang Cai
Jian Gao
Danfeng Yan
VLM
129
0
0
24 Nov 2025
Be My Eyes: Extending Large Language Models to New Modalities Through Multi-Agent Collaboration
James Y. Huang
Sheng Zhang
Qianchu Liu
Guanghui Qin
Tinghui Zhu
Tristan Naumann
Muhao Chen
Hoifung Poon
VLM
LRM
133
0
0
24 Nov 2025
OrdMoE: Preference Alignment via Hierarchical Expert Group Ranking in Multimodal Mixture-of-Experts LLMs
Yuting Gao
Weihao Chen
L. xilinx Wang
Ruihan Xu
Q. Guo
MoE
108
0
0
24 Nov 2025
Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents
Dayong Liu
Chao Xu
Weihong Chen
Suyu Zhang
Juncheng Wang
Jiankang Deng
Baigui Sun
Yang Liu
LM&Ro
253
0
0
24 Nov 2025
VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning
Boyu Chen
Zikang Wang
Zhengrong Yue
Kainan Yan
Chenyun Yu
...
Yafei Wen
Xiaoxin Chen
Yang Liu
Peng Li
Yali Wang
LLMAG
300
3
0
24 Nov 2025
Vidi2: Large Multimodal Models for Video Understanding and Creation
Vidi Team
Celong Liu
Chia-Wen Kuo
Chuang Huang
Dawei Du
...
Wen Zhong
Xiaohui Shen
Xin Gu
Zhenfang Chen
Zuhua Lin
60
0
0
24 Nov 2025
Unboxing the Black Box: Mechanistic Interpretability for Algorithmic Understanding of Neural Networks
Bianka Kowalska
Halina Kwaśnicka
147
0
0
24 Nov 2025
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
Xiyang Wu
Zongxia Li
Jihui Jin
Guangyao Shi
Gouthaman KV
Vishnu Raj
Nilotpal Sinha
Jingxi Chen
Fan Du
Dinesh Manocha
124
0
0
23 Nov 2025
AnyExperts: On-Demand Expert Allocation for Multimodal Language Models with Mixture of Expert
Yuting Gao
Wang Lan
Hengyuan Zhao
Linjiang Huang
Si Liu
Q. Guo
MoE
160
0
0
23 Nov 2025
EgoVITA: Learning to Plan and Verify for Egocentric Video Reasoning
Yogesh Kulkarni
Pooyan Fazli
EgoV
LRM
325
0
0
23 Nov 2025
ChineseVideoBench: Benchmarking Multi-modal Large Models for Chinese Video Question Answering
Yuxiang Nie
Han Wang
Yongjie Ye
Haiyang Yu
Weitao Jia
...
Zehui Dai
Jiacong Wang
Dingkang Yang
An-Lan Wang
Can Huang
ELM
96
0
0
23 Nov 2025
EventBench: Towards Comprehensive Benchmarking of Event-based MLLMs
Shaoyu Liu
Jianing Li
Guanghui Zhao
Y. Zhang
Xiangyang Ji
69
0
0
23 Nov 2025
Test-Time Temporal Sampling for Efficient MLLM Video Understanding
Kaibin Wang
Mingbao Lin
96
0
0
22 Nov 2025
Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination
Y. Tang
Daiki Shimada
Hang Hua
Chao Huang
Jing Bi
Rogerio Feris
Chenliang Xu
225
0
0
21 Nov 2025
TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
Boshen Xu
Zihan Xiao
Jiaze Li
Jianzhong Ju
Zhenbo Luo
Jian Luan
Qin Jin
Mamba
507
0
0
20 Nov 2025
VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning
Zishan Xu
Yifu Guo
Y. Lu
Fengyu Yang
J. Li
VOS
208
1
0
20 Nov 2025
A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models
Duo Li
Zuhao Yang
Xiaoqin Zhang
Ling Shao
Shijian Lu
VLM
146
1
0
19 Nov 2025
MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
Yushi Huang
Z. Wang
Zhihang Yuan
Yifu Ding
Ruihao Gong
Jinyang Guo
Xianglong Liu
Jun Zhang
MoE
VLM
232
1
0
19 Nov 2025
Multimodal Evaluation of Russian-language Architectures
Artem Chervyakov
Ulyana Isaeva
Anton A. Emelyanov
Artem Safin
Maria Tikhonova
...
Ilseyar Alimova
Ilseyar Alimova
A. Kapitanov
Alena Fenogenova
Alena Fenogenova
274
1
0
19 Nov 2025
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models
Keda Tao
Kele Shao
Bohan Yu
Weiqiang Wang
Jian Liu
Huan Wang
VLM
237
2
0
18 Nov 2025
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding
Jiaze Li
Hao Yin
Wenhui Tan
Jingyang Chen
Boshen Xu
Yuxun Qu
Yijing Chen
Jianzhong Ju
Zhenbo Luo
Jian Luan
LRM
VLM
226
1
0
17 Nov 2025
Minimax Multi-Target Conformal Prediction with Applications to Imaging Inverse Problems
Jeffrey Wen
Rizwan Ahmad
Philip Schniter
MedIm
323
0
0
17 Nov 2025
Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data
Yunxin Li
Xinyu Chen
Shenyuan Jiang
Haoyuan Shi
Zhenyu Liu
...
Zhenran Xu
Yicheng Ma
Meishan Zhang
Baotian Hu
Min Zhang
MLLM
MoE
OSLM
VLM
575
1
0
16 Nov 2025
ReaSon: Reinforced Causal Search with Information Bottleneck for Video Understanding
Yuan Zhou
Litao Hua
Shilong Jin
Wentao Huang
Haoran Duan
CML
VGen
213
0
0
16 Nov 2025
CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models
Jingyao Li
Jingyun Wang
Molin Tan
Haochen Wang
Cilin Yan
Likun Shi
Jiayin Cai
Xiaolong Jiang
Yao Hu
VLM
LRM
144
0
0
15 Nov 2025
OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs
Feng Chen
Yefei He
Shaoxuan He
Yuanyu He
Jing Liu
...
Zhaoyang Li
Jiyuan Zhang
Zhenbang Sun
Bohan Zhuang
Qi Wu
VLM
175
0
0
15 Nov 2025
Striking the Right Balance between Compute and Copy: Improving LLM Inferencing Under Speculative Decoding
Arun Ramachandran
Ramaswamy Govindarajan
M. Annavaram
Prakash Raghavendra
Hossein Entezari Zarch
Lei Gao
Chaoyi Jiang
124
0
0
15 Nov 2025
Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models
Siyou Li
Huanan Wu
Juexi Shao
Yinghao Ma
Yujian Gan
...
Lu Wang
Wengqing Wu
Le Zhang
Massimo Poesio
Juntao Yu
VLM
144
0
0
14 Nov 2025
Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning
Jialong Qin
Xin Zou
Di Lu
Yibo Yan
Xuming Hu
VLM
242
0
0
11 Nov 2025
1
2
3
4
...
9
10
11
Next