v1v2v3v4 (latest)

MMBench: Is Your Multi-modal Model an All-around Player?

European Conference on Computer Vision (ECCV), 2023

12 July 2023

Conghui He

Ziwei Liu

Kai-xiang Chen

Dahua Lin

ArXiv (abs)PDF HTML HuggingFace (5 upvotes)

Papers citing "MMBench: Is Your Multi-modal Model an All-around Player?"

50 / 687 papers shown

Reinforcement Learning for Large Model: A Survey

323

24 Dec 2025

Multimodal Reinforcement Learning with Agentic Verifier for AI Agents

...

194

03 Dec 2025

Jina-VLM: Small Multilingual Vision Language Model

380

03 Dec 2025

ViDiC: Video Difference Captioning

173

03 Dec 2025

V-ITI: Mitigating Hallucinations in Multimodal Large Language Models via Visual Inference-Time Intervention

...

144

03 Dec 2025

MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation

130

02 Dec 2025

Emergent Bayesian Behaviour and Optimal Cue Combination in LLMs

Julian Ma

Jun Wang

Zafeirios Fountas

02 Dec 2025

Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for Multimodal Large Language Models

193

01 Dec 2025

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

...

201

01 Dec 2025

Optimizing LVLMs with On-Policy Data for Effective Hallucination Mitigation

280

30 Nov 2025

REM: Evaluating LLM Embodied Spatial Reasoning through Multi-Frame Trajectories

Jacob Thompson

Emiliano Garcia-Lopez

Yonatan Bisk

LRM

137

30 Nov 2025

When Harmful Content Gets Camouflaged: Unveiling Perception Failure of LVLMs with CamHarmTI

29 Nov 2025

Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction

Jiazhen Liu

Mingkuan Feng

Long Chen

29 Nov 2025

VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

...

227

28 Nov 2025

Visual Generation Tuning

306

28 Nov 2025

WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

...

27 Nov 2025

Unexplored flaws in multiple-choice VQA evaluations

27 Nov 2025

From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images

27 Nov 2025

CaptionQA: Is Your Caption as Useful as the Image Itself?

205

26 Nov 2025

Object-Centric Vision Token Pruning for Vision Language Models

197

25 Nov 2025

HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation

...

175

25 Nov 2025

UniGame: Turning a Unified Multimodal Model Into Its Own Adversary

167

24 Nov 2025

Robot-Powered Data Flywheels: Deploying Robots in the Wild for Continual Data Collection and Foundation Model Adaptation

343

24 Nov 2025

Parallel Vision Token Scheduling for Fast and Accurate Multimodal LMMs Inference

227

24 Nov 2025

Self-Empowering VLMs: Achieving Hierarchical Consistency via Self-Elicited Knowledge Distillation

137

23 Nov 2025

ConsistCompose: Unified Multimodal Layout Control for Image Composition

389

23 Nov 2025

AnyExperts: On-Demand Expert Allocation for Multimodal Language Models with Mixture of Expert

168

23 Nov 2025

FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning

187

22 Nov 2025

RoadBench: Benchmarking MLLMs on Fine-Grained Spatial Understanding and Reasoning under Urban Road Scenarios

127

22 Nov 2025

VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference

327

20 Nov 2025

Learning to Think Fast and Slow for Visual Language Models

226

20 Nov 2025

First Frame Is the Place to Go for Video Content Customization

207

19 Nov 2025

MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping

265

19 Nov 2025

A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models

157

19 Nov 2025

Multimodal Evaluation of Russian-language Architectures

...

341

19 Nov 2025

When to Think and When to Look: Uncertainty-Guided Lookback

...

290

19 Nov 2025

FlexiCup: Wireless Multimodal Suction Cup with Dual-Zone Vision-Tactile Sensing

...

153

18 Nov 2025

Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution

18 Nov 2025

CreBench: Human-Aligned Creativity Evaluation from Idea to Process to Product

...

130

17 Nov 2025

Explore How to Inject Beneficial Noise in MLLMs

211

17 Nov 2025

BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections

165

16 Nov 2025

RedVTP: Training-Free Acceleration of Diffusion Vision-Language Models Inference via Masked Token-Guided Visual Token Pruning

181

16 Nov 2025

Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

...

625

16 Nov 2025

$D$^{3}$ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLMs$

^{3}

ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLMs

15 Nov 2025

TopoPerception: A Shortcut-Free Evaluation of Global Visual Perception in Large Vision-Language Models

171

14 Nov 2025

MACEval: A Multi-Agent Continual Evaluation Network for Large Models

235

12 Nov 2025

Learning with Preserving for Continual Multitask Learning

198

11 Nov 2025

RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation

Haofeng Wang

Yu Zhang

LRM

10 Nov 2025

Unveiling Modality Bias: Automated Sample-Specific Analysis for Multimodal Misinformation Benchmarks

232

08 Nov 2025

Visual Spatial Tuning

...

347

07 Nov 2025