v1v2v3 (latest)

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

31 May 2024

ArXiv (abs)PDF HTML HuggingFace (25 upvotes)

Papers citing "Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis"

50 / 550 papers shown

VideoAds for Fast-Paced Video Understanding

289

12 Apr 2025

PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language ModelsComputer Vision and Pattern Recognition (CVPR), 2025

318

11 Apr 2025

Kimi-VL Technical Report

...

976

143

10 Apr 2025

VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning

316

10 Apr 2025

Capybara-OMNI: An Efficient Paradigm for Building Omni-Modal Language Models

302

10 Apr 2025

SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained UnderstandingComputer Vision and Pattern Recognition (CVPR), 2025

197

10 Apr 2025

LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding

219

09 Apr 2025

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

808

120

09 Apr 2025

PaMi-VDPO: Mitigating Video Hallucinations by Prompt-Aware Multi-Instance Video Preference Learning

1.1K

08 Apr 2025

SmolVLM: Redefining small and efficient multimodal models

...

463

117

07 Apr 2025

Advancing Egocentric Video Question Answering with Multimodal Large Language Models

Alkesh Patel

Vibhav Chitalia

Yinfei Yang

184

06 Apr 2025

MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

419

04 Apr 2025

SocialGesture: Delving into Multi-person Gesture UnderstandingComputer Vision and Pattern Recognition (CVPR), 2025

230

03 Apr 2025

Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation

677

03 Apr 2025

TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding

305

02 Apr 2025

Slow-Fast Architecture for Video Multi-Modal Large Language Models

228

02 Apr 2025

GazeLLM: Multimodal LLMs incorporating Human Visual AttentionNASA/ESA Conference on Adaptive Hardware and Systems (AHS), 2025

Jun Rekimoto

213

31 Mar 2025

Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1

297

31 Mar 2025

STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?

598

31 Mar 2025

Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMsComputer Vision and Pattern Recognition (CVPR), 2025

270

31 Mar 2025

Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs

435

29 Mar 2025

OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video ContextsComputer Vision and Pattern Recognition (CVPR), 2025

331

29 Mar 2025

FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMsComputer Vision and Pattern Recognition (CVPR), 2025

330

27 Mar 2025

Video-R1: Reinforcing Video Reasoning in MLLMs

581

230

27 Mar 2025

BOLT: Boost Large Vision-Language Model Without Training for Long-form Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2025

292

27 Mar 2025

MAVERIX: Multimodal Audio-Visual Evaluation and Recognition IndeX

...

317

27 Mar 2025

Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving

366

27 Mar 2025

From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment

316

26 Mar 2025

Qwen2.5-Omni Technical Report

...

1.2K

344

26 Mar 2025

FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs

Carlos Plou

Cesar Borja

Ruben Martinez-Cantin

Ana C. Murillo

339

25 Mar 2025

ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models

316

25 Mar 2025

Audio-centric Video Understanding Benchmark without Text Shortcut

423

25 Mar 2025

PAVE: Patching and Adapting Video Large Language ModelsComputer Vision and Pattern Recognition (CVPR), 2025

362

25 Mar 2025

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

425

24 Mar 2025

Breaking the Encoder Barrier for Seamless Video-Language Understanding

322

24 Mar 2025

LLaVAction: evaluating and training multi-modal large language models for action understanding

361

24 Mar 2025

Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models

...

200

24 Mar 2025

Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding

539

24 Mar 2025

GOAL: Global-local Object Alignment LearningComputer Vision and Pattern Recognition (CVPR), 2025

918

22 Mar 2025

V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction

343

22 Mar 2025

4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding

...

275

22 Mar 2025

Judge Anything: MLLM as a Judge Across Any Modality

...

244

21 Mar 2025

What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation?

Xuanming Cui

Jaiminkumar Ashokbhai Bhoi

Chionh Wei Peng

Adriel Kuek

Ser-Nam Lim

278

20 Mar 2025

Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language ModelsComputer Vision and Pattern Recognition (CVPR), 2025

236

20 Mar 2025

Agentic Keyframe Search for Video Question Answering

Sunqi Fan

Meng-Hao Guo

Shuojin Yang

216

20 Mar 2025

XAttention: Block Sparse Attention with Antidiagonal Scoring

336

20 Mar 2025

FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding

356

19 Mar 2025

Improving LLM Video Understanding with 16 Frames Per Second

420

18 Mar 2025

317

18 Mar 2025

CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language ModelsComputer Vision and Pattern Recognition (CVPR), 2025

260

18 Mar 2025