v1v2v3v4 (latest)

MMBench: Is Your Multi-modal Model an All-around Player?

European Conference on Computer Vision (ECCV), 2023

12 July 2023

Conghui He

Ziwei Liu

Kai-xiang Chen

Dahua Lin

ArXiv (abs)PDF HTML HuggingFace (5 upvotes)

Papers citing "MMBench: Is Your Multi-modal Model an All-around Player?"

50 / 687 papers shown

MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

...

473

26 May 2025

Causal-LLaVA: Causal Disentanglement for Mitigating Hallucination in Multimodal Large Language Models

286

26 May 2025

Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs

444

26 May 2025

FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities

396

26 May 2025

SATORI-R1: Incentivizing Multimodal Reasoning through Explicit Visual Anchoring

465

25 May 2025

Caption This, Reason That: VLMs Caught in the Middle

Zihan Weng

Lucas Gomez

Taylor Whittington Webb

P. Bashivan

VLM LRM

384

24 May 2025

Doc-CoB: Enhancing Multi-Modal Document Understanding with Visual Chain-of-Boxes Reasoning

...

298

24 May 2025

MLLMs are Deeply Affected by Modality Bias

...

332

24 May 2025

ToDRE: Effective Visual Token Pruning via Token Diversity and Task Relevance

502

24 May 2025

CAS-IQA: Teaching Vision-Language Models for Synthetic Angiography Quality Assessment

210

23 May 2025

Hierarchical Safety Realignment: Lightweight Restoration of Safety in Pruned Large Vision-Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

349

22 May 2025

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

440

22 May 2025

Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal DecodingComputer Vision and Pattern Recognition (CVPR), 2025

...

303

22 May 2025

Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models

295

22 May 2025

Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding

365

22 May 2025

Streamline Without Sacrifice - Squeeze out Computation Redundancy in LMM

Penghao Wu

Lewei Lu

Ziwei Liu

285

21 May 2025

ModRWKV: Transformer Multimodality in Linear Time

234

20 May 2025

VoQA: Visual-only Question Answering

348

20 May 2025

Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning

OffRL ReLM AI4TS VLM LRM

333

20 May 2025

MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO

314

19 May 2025

SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning

531

18 May 2025

Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning

334

17 May 2025

Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans

457

16 May 2025

Bias and Generalizability of Foundation Models across Datasets in Breast MammographyInternational Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2025

339

14 May 2025

Visual Instruction Tuning with Chain of Region-of-Interest

282

11 May 2025

Emotion-Qwen: A Unified Framework for Emotion and Vision Understanding

...

316

10 May 2025

SITE: towards Spatial Intelligence Thorough Evaluation

293

08 May 2025

Do MLLMs Capture How Interfaces Guide User Behavior? A Benchmark for Multimodal UI/UX Design Understanding

567

08 May 2025

TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

447

08 May 2025

FG-CLIP: Fine-Grained Visual and Textual Alignment

604

08 May 2025

Multi-Agent System for Comprehensive Soccer Understanding

387

06 May 2025

Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction

...

585

05 May 2025

SEFE: Superficial and Essential Forgetting Eliminator for Multimodal Continual Instruction Tuning

361

05 May 2025

Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

...

1.2K

05 May 2025

GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling

...

584

30 Apr 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series UnderstandingComputer Vision and Pattern Recognition (CVPR), 2025

529

30 Apr 2025

Multimodal Language Models See Better When They Look Shallower

356

30 Apr 2025

Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object PerceptionComputer Vision and Pattern Recognition (CVPR), 2025

536

29 Apr 2025

VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning

525

28 Apr 2025

Anyprefer: An Agentic Framework for Preference Data SynthesisInternational Conference on Learning Representations (ICLR), 2025

...

445

27 Apr 2025

DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language ModelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025

...

1.1K

25 Apr 2025

Sparsity Forcing: Reinforcing Token Sparsity of MLLMs

365

23 Apr 2025

DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs

400

23 Apr 2025

Unveiling the Lack of LVLM Robustness to Fundamental Visual Variations: Why and Path ForwardAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

630

23 Apr 2025

Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark

402

20 Apr 2025

Evaluating Menu OCR and Translation: A Benchmark for Aligning Human and Automated Evaluations in Large Vision-Language Models

...

475

16 Apr 2025

Benchmarking Vision Language Models on German Factual DataArtificial Intelligence Applications and Innovations (AIAI), 2025

René Peinl

Vincent Tischler

CoGe

345

15 Apr 2025

TAMP: Token-Adaptive Layerwise Pruning in Multimodal Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

378

14 Apr 2025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

...

672

829

14 Apr 2025

FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

478

14 Apr 2025