v1v2v3 (latest)

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

31 May 2024

ArXiv (abs)PDF HTML HuggingFace (25 upvotes)

Papers citing "Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis"

50 / 550 papers shown

Inference Compute-Optimal Video Vision Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

276

24 May 2025

Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

200

23 May 2025

Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding

641

23 May 2025

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

433

22 May 2025

CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms

298

22 May 2025

From Evaluation to Defense: Advancing Safety in Video Large Language Models

198

22 May 2025

QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design

235

22 May 2025

Clapper: Compact Learning and Video Representation in VLMs

220

21 May 2025

Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal InconsistencyAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

247

20 May 2025

VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation

262

20 May 2025

Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models

172

20 May 2025

SurveillanceVQA-589K: A Benchmark for Comprehensive Surveillance Video-Language Understanding with Large Models

258

19 May 2025

Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models

Keunwoo Peter Yu

Joyce Chai

MLLM VLM

289

16 May 2025

Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation

464

10 May 2025

StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant

616

08 May 2025

SITE: towards Spatial Intelligence Thorough Evaluation

290

08 May 2025

R^3-VQA: "Read the Room" by Video Social Reasoning

284

07 May 2025

RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

...

379

04 May 2025

Grounding Task Assistance with Multimodal Cues from a Single DemonstrationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

Gabriel Sarch

Balasaravanan Thoravi Kumaravel

Sahithya Ravi

Vibhav Vineet

A. D. Wilson

944

02 May 2025

AVA: Towards Agentic Video Analytics with Vision Language Models

533

01 May 2025

MINERVA: Evaluating Complex Video Reasoning

...

339

01 May 2025

Static or Dynamic: Towards Query-Adaptive Token Selection for Video Question Answering

Yumeng Shi

Quanyu Long

Wenya Wang

303

30 Apr 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series UnderstandingComputer Vision and Pattern Recognition (CVPR), 2025

521

30 Apr 2025

UniversalRAG: Retrieval-Augmented Generation over Corpora of Diverse Modalities and Granularities

400

29 Apr 2025

Learning Streaming Video Representation via Multitask Training

503

28 Apr 2025

Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks

...

554

26 Apr 2025

ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding

254

25 Apr 2025

VEU-Bench: Towards Comprehensive Understanding of Video EditingComputer Vision and Pattern Recognition (CVPR), 2025

306

24 Apr 2025

TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos

...

279

24 Apr 2025

FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding

De-An Huang

Subhashree Radhakrishnan

Zhiding Yu

Jan Kautz

VGen VLM

436

24 Apr 2025

$VideoVista-CulturalLingo: 360$^\circ$ Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension$

VideoVista-CulturalLingo: 360

^\circ

Horizons-Bridging Cultures, Languages, and Domains in Video ComprehensionAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

291

23 Apr 2025

Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

415

23 Apr 2025

Sparsity Forcing: Reinforcing Token Sparsity of MLLMs

361

23 Apr 2025

Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning

...

OffRL AI4TS SyDa LRM VLM

533

23 Apr 2025

Vidi: Large Multimodal Models for Video Understanding and Editing

...

359

22 Apr 2025

MR. Video: "MapReduce" is the Principle for Long Video Understanding

Ziqi Pang

Yu-Xiong Wang

VLM

275

22 Apr 2025

MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention

...

352

22 Apr 2025

An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes

271

21 Apr 2025

Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding

203

20 Apr 2025

Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark

398

20 Apr 2025

VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment

Yogesh Kulkarni

Pooyan Fazli

516

18 Apr 2025

VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

977

17 Apr 2025

Perception Encoder: The best visual embeddings are not at the output of the network

Daniel Bolya

Po-Yao (Bernie) Huang

...

Christoph Feichtenhofer

ObjD VOS

666

107

17 Apr 2025

Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization

Pritam Sarkar

Ali Etemad

328

16 Apr 2025

TerraMind: Large-Scale Generative Multimodality for Earth Observation

...

Alessandra Feliciotti

MLLM VLM

490

15 Apr 2025

Reimagining Urban Science: Scaling Causal Inference with Large Language Models

...

952

15 Apr 2025

Multimodal Long Video Modeling Based on Temporal Dynamic Context

495

14 Apr 2025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

...

596

790

14 Apr 2025

Mavors: Multi-granularity Video Representation for Multimodal Large Language Model

...

380

14 Apr 2025

TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning

346

13 Apr 2025