v1v2v3 (latest)

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

31 May 2024

ArXiv (abs)PDF HTML HuggingFace (25 upvotes)

Papers citing "Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis"

50 / 550 papers shown

WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

118

26 Sep 2025

MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning

...

393

25 Sep 2025

VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception

213

25 Sep 2025

ConViS-Bench: Estimating Video Similarity Through Semantic Concepts

147

23 Sep 2025

Does Audio Matter for Modern Video-LLMs and Their Benchmarks?

Geewook Kim

Minjoon Seo

AuLLM

117

22 Sep 2025

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

381

22 Sep 2025

Qwen3-Omni Technical Report

...

215

22 Sep 2025

TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?

Zhongyuan Bao

Lejun Zhang

LRM

259

19 Sep 2025

ChronoForge-RL: Chronological Forging through Reinforcement Learning for Enhanced Video Understanding

Kehua Chen

VGen

118

19 Sep 2025

Frame Sampling Strategies Matter: A Benchmark for small vision language models

115

18 Sep 2025

MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook

...

127

17 Sep 2025

Cinéaste: A Fine-grained Contextual Movie Question Answering Benchmark

133

17 Sep 2025

AToken: A Unified Tokenizer for Vision

249

17 Sep 2025

Enhancing Video Large Language Models with Structured Multi-Video Collaborative Reasoning

201

16 Sep 2025

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

...

198

16 Sep 2025

Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding

...

169

15 Sep 2025

FineQuest: Adaptive Knowledge-Assisted Sports Video Understanding via Agent-of-Thoughts Reasoning

175

15 Sep 2025

DATE: Dynamic Absolute Time Enhancement for Long Video Understanding

Chao Yuan

Y. Yang

Zach Cheng

11 Sep 2025

Chirality in Action: Time-Aware Video Representation Learning by Latent Straightening

Piyush Bagad

Andrew Zisserman

AI4TS

243

10 Sep 2025

MESH -- Understanding Videos Like Human: Measuring Hallucinations in Large Video Models

180

10 Sep 2025

AdsQA: Towards Advertisement Video Understanding

...

144

10 Sep 2025

Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs

142

09 Sep 2025

In the Eye of MLLM: Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting

138

09 Sep 2025

WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning

133

05 Sep 2025

Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data

120

03 Sep 2025

ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly

Kimihiro Hasegawa

Wiradee Imrattanatrai

163

03 Sep 2025

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

322

03 Sep 2025

VLMs-in-the-Wild: Bridging the Gap Between Academic Benchmarks and Enterprise Reality

Srihari Bandraupalli

Anupam Purwar

VLM

03 Sep 2025

RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events

259

02 Sep 2025

PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?

Mennatullah Siam

VGen

118

02 Sep 2025

Robix: A Unified Model for Robot Interaction, Reasoning and Planning

168

01 Sep 2025

Kwai Keye-VL 1.5 Technical Report

...

333

01 Sep 2025

Variation-aware Vision Token Dropping for Faster Large Vision-Language Models

01 Sep 2025

Seeing More, Saying More: Lightweight Language Experts are Dynamic Video Token Compressors

142

31 Aug 2025

LightVLM: Acceleraing Large Multimodal Models with Pyramid Token Merging and KV Cache Compression

135

30 Aug 2025

VideoRewardBench: Comprehensive Evaluation of Multimodal Reward Models for Video Understanding

119

30 Aug 2025

ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

123

29 Aug 2025

Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding

109

28 Aug 2025

Video-LevelGauge: Investigating Contextual Positional Bias in Large Video Language Models

192

27 Aug 2025

CVBench: Benchmarking Cross-Video Synergies for Complex Multimodal Reasoning

...

247

27 Aug 2025

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

...

305

279

25 Aug 2025

Language-Guided Temporal Token Pruning for Efficient VideoLLM Processing

Yogesh Kumar

VLM

104

25 Aug 2025

AVAM: Universal Training-free Adaptive Visual Anchoring Embedded into Multimodal Large Language Model for Multi-image Question Answering

140

25 Aug 2025

Murakkab: Resource-Efficient Agentic Workflow Orchestration in Cloud Platforms

128

22 Aug 2025

An Empirical Study on How Video-LLMs Answer Video Questions

152

21 Aug 2025

StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding

124

21 Aug 2025

HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes

...

182

19 Aug 2025

Mitigating Easy Option Bias in Multiple-Choice Question Answering

132

19 Aug 2025

EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding

Ashish Seth

Utkarsh Tyagi

Ramaneswaran Selvakumar

229

18 Aug 2025

Ovis2.5 Technical Report

...

156

15 Aug 2025