v1v2v3 (latest)

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

31 May 2024

ArXiv (abs)PDF HTML HuggingFace (25 upvotes)

Papers citing "Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis"

50 / 550 papers shown

Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

...

617

16 Nov 2025

ReaSon: Reinforced Causal Search with Information Bottleneck for Video Understanding

252

16 Nov 2025

Striking the Right Balance between Compute and Copy: Improving LLM Inferencing Under Speculative Decoding

Arun Ramachandran

Ramaswamy Govindarajan

M. Annavaram

Prakash Raghavendra

Hossein Entezari Zarch

Lei Gao

Chaoyi Jiang

149

15 Nov 2025

CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models

213

15 Nov 2025

OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs

...

192

15 Nov 2025

Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models

...

165

14 Nov 2025

Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning

256

11 Nov 2025

UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist

...

204

11 Nov 2025

StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression

144

10 Nov 2025

TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning

124

07 Nov 2025

SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding

363

06 Nov 2025

Cambrian-S: Towards Spatial Supersensing in Video

...

178

06 Nov 2025

NVIDIA Nemotron Nano V2 VL

Nvidia

Amala Sanjay Deshmukh

...

311

06 Nov 2025

Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts

230

06 Nov 2025

Rethinking Facial Expression Recognition in the Era of Multimodal Large Language Models: Benchmark, Datasets, and Beyond

...

139

01 Nov 2025

FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding

31 Oct 2025

LongCat-Flash-Omni Technical Report

...

590

31 Oct 2025

FOCUS: Efficient Keyframe Selection for Long Video Understanding

159

31 Oct 2025

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

197

30 Oct 2025

EgoExo-Con: Exploring View-Invariant Video Temporal Understanding

127

30 Oct 2025

Seeing, Signing, and Saying: A Vision-Language Model-Assisted Pipeline for Sign Language Data Acquisition and Curation from Social Media

250

29 Oct 2025

Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders

29 Oct 2025

Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation

...

350

28 Oct 2025

Revisiting Multimodal Positional Encoding in Vision-Language Models

161

27 Oct 2025

EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

215

27 Oct 2025

Positional Preservation Embedding for Multimodal Large Language Models

287

27 Oct 2025

MUStReason: A Benchmark for Diagnosing Pragmatic Reasoning in Video-LMs for Multimodal Sarcasm Detection

27 Oct 2025

Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence

390

23 Oct 2025

Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

...

334

23 Oct 2025

SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding

23 Oct 2025

[De|Re]constructing VLMs' Reasoning in Counting

205

22 Oct 2025

Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation

241

22 Oct 2025

IF-VidCap: Can Video Caption Models Follow Instructions?

...

151

21 Oct 2025

StreamingTOM: Streaming Token Compression for Efficient Video Understanding

198

21 Oct 2025

SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference

Konstantinos N. Plataniotis

183

20 Oct 2025

MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

...

Tianhao Peng

Jiaheng Liu

165

20 Oct 2025

Video Reasoning without Training

192

19 Oct 2025

VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs

254

18 Oct 2025

Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning

17 Oct 2025

VISTA: A Test-Time Self-Improving Video Generation Agent

250

17 Oct 2025

OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

...

187

17 Oct 2025

XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

101

16 Oct 2025

MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning

...

164

16 Oct 2025

VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning

Jinglei Zhang

Yuanfan Guo

Rolandos Alexandros Potamias

124

16 Oct 2025

Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding

114

15 Oct 2025

MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models

...

107

15 Oct 2025

Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception

...

14 Oct 2025

An Empirical Study for Representations of Videos in Video Question Answering via MLLMs

14 Oct 2025

K-frames: Scene-Driven Any-k Keyframe Selection for long video understanding

136

14 Oct 2025

MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites

...

248

14 Oct 2025