v1v2v3 (latest)

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

31 May 2024

ArXiv (abs)PDF HTML HuggingFace (25 upvotes)

Papers citing "Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis"

50 / 550 papers shown

VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

942

17 Mar 2025

Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric VideosComputer Vision and Pattern Recognition (CVPR), 2025

329

17 Mar 2025

ViSpeak: Visual Instruction Feedback in Streaming Videos

302

17 Mar 2025

Efficient Motion-Aware Video MLLMComputer Vision and Pattern Recognition (CVPR), 2025

265

17 Mar 2025

Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding

359

17 Mar 2025

NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models

257

17 Mar 2025

Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?

981

16 Mar 2025

AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language UnderstandingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

421

16 Mar 2025

Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers

318

14 Mar 2025

V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning

337

14 Mar 2025

FastVID: Dynamic Density Pruning for Fast Video Large Language Models

410

14 Mar 2025

LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents

1.3K

13 Mar 2025

TIME: Temporal-Sensitive Multi-Dimensional Instruction Tuning and Robust Benchmarking for Video-LLMs

261

13 Mar 2025

TruthPrInt: Mitigating LVLM Object Hallucination Via Latent Truthful-Guided Pre-Intervention

999

13 Mar 2025

Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing

516

13 Mar 2025

Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question EvaluationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

268

12 Mar 2025

CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games

...

344

12 Mar 2025

BIMBA: Selective-Scan Compression for Long-Range Video Question AnsweringComputer Vision and Pattern Recognition (CVPR), 2025

1.0K

12 Mar 2025

Everything Can Be Described in Words: A Simple Unified Multi-Modal Framework with Semantic and Temporal Alignment

Xiaowei Bi

Zheyuan Xu

359

12 Mar 2025

VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers

730

12 Mar 2025

Generative Frame Sampler for Long Video UnderstandingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

290

12 Mar 2025

Memory-enhanced Retrieval Augmentation for Long Video Understanding

365

12 Mar 2025

EgoBlind: Towards Egocentric Visual Assistance for the Blind

503

11 Mar 2025

RAG-Adapter: A Plug-and-Play RAG-enhanced Framework for Long Video Understanding

248

11 Mar 2025

ALLVB: All-in-One Long Video Understanding BenchmarkAAAI Conference on Artificial Intelligence (AAAI), 2025

391

10 Mar 2025

Video Action DifferencingInternational Conference on Learning Representations (ICLR), 2025

317

10 Mar 2025

StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition

938

08 Mar 2025

UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban SpacesAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

...

352

08 Mar 2025

CASP: Compression of Large Multimodal Models Based on Attention SparsityComputer Vision and Pattern Recognition (CVPR), 2025

263

07 Mar 2025

Unified Reward Model for Multimodal Understanding and Generation

397

07 Mar 2025

^2

AT: Multimodal Jailbreak Defense via Dynamic Joint Optimization for Multimodal Large Language Models

447

05 Mar 2025

EgoLife: Towards Egocentric Life AssistantComputer Vision and Pattern Recognition (CVPR), 2025

...

278

05 Mar 2025

HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal SynchronizationComputer Vision and Pattern Recognition (CVPR), 2025

430

03 Mar 2025

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Abdelrahman Abouelenin

...

302

294

03 Mar 2025

Adaptive Keyframe Sampling for Long Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2025

268

28 Feb 2025

Nexus: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision

...

600

26 Feb 2025

M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance

...

632

26 Feb 2025

MMAD: A Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly DetectionInternational Conference on Learning Representations (ICLR), 2024

422

24 Feb 2025

MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models

487

23 Feb 2025

Magma: A Foundation Model for Multimodal AI AgentsComputer Vision and Pattern Recognition (CVPR), 2025

...

371

18 Feb 2025

SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic EmbeddingsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

285

18 Feb 2025

VRoPE: Rotary Position Embedding for Video Large Language Models

386

17 Feb 2025

video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model

323

17 Feb 2025

Unhackable Temporal Rewarding for Scalable Video MLLMs

...

286

17 Feb 2025

SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video UnderstandingInternational Conference on Learning Representations (ICLR), 2025

334

15 Feb 2025

CoS: Chain-of-Shot Prompting for Long Video Understanding

303

10 Feb 2025

LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models

Tzu-Tao Chang

Shivaram Venkataraman

VLM

1.3K

04 Feb 2025

VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos

355

03 Feb 2025

Towards Robust Multimodal Large Language Models Against Jailbreak Attacks

341

02 Feb 2025

Baichuan-Omni-1.5 Technical Report

Tao Zhang

...

330

28 Jan 2025