v1v2v3v4 (latest)

MMBench: Is Your Multi-modal Model an All-around Player?

European Conference on Computer Vision (ECCV), 2023

12 July 2023

Conghui He

Ziwei Liu

Kai-xiang Chen

Dahua Lin

ArXiv (abs)PDF HTML HuggingFace (5 upvotes)

Papers citing "MMBench: Is Your Multi-modal Model an All-around Player?"

50 / 687 papers shown

308

14 Mar 2025

Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection

328

14 Mar 2025

EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks

...

288

14 Mar 2025

Learning to Inference Adaptively for Multimodal Large Language Models

462

13 Mar 2025

Towards Understanding Graphical Perception in Large Multimodal Models

322

13 Mar 2025

DAVE: Diagnostic benchmark for Audio Visual Evaluation

271

12 Mar 2025

Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question EvaluationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

268

12 Mar 2025

Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis

470

11 Mar 2025

Attention Hijackers: Detect and Disentangle Attention Hijacking in LVLMs for Hallucination Mitigation

544

11 Mar 2025

Robusto-1 Dataset: Comparing Humans and VLMs on real out-of-distribution Autonomous Driving VQA from Peru

Dunant Cusipuma

David Ortega

Victor Flores-Benites

Arturo Deza

OOD

306

10 Mar 2025

VisRL: Intention-Driven Visual Perception via Reinforced Reasoning

442

10 Mar 2025

Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction TuningComputer Vision and Pattern Recognition (CVPR), 2025

1.1K

10 Mar 2025

SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation

719

09 Mar 2025

VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

219

09 Mar 2025

Statistical Study of Sensor Data and Investigation of ML-based Calibration Algorithms for Inexpensive Sensor Modules: Experiments from Cape PointIEEE Transactions on Instrumentation and Measurement (IEEE Trans. Instrum. Meas.), 2025

Travis Barrett

Amit Kumar Mishra

329

09 Mar 2025

ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges

...

952

09 Mar 2025

SplatTalk: 3D VQA with Gaussian Splatting

598

08 Mar 2025

Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts

499

07 Mar 2025

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLMAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

Siyang Song

Mohammed Irfan Kurpath

Sahal Shaji Mullappilly

660

06 Mar 2025

ToFu: Visual Tokens Reduction via Fusion for Multi-modal, Multi-patch, Multi-image Task

199

06 Mar 2025

See What You Are Told: Visual Attention Sink in Large Multimodal ModelsInternational Conference on Learning Representations (ICLR), 2025

380

05 Mar 2025

Words or Vision: Do Vision-Language Models Have Blind Faith in Text?Computer Vision and Pattern Recognition (CVPR), 2025

346

04 Mar 2025

DivPrune: Diversity-based Visual Token Pruning for Large Multimodal ModelsComputer Vision and Pattern Recognition (CVPR), 2025

567

04 Mar 2025

Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs

361

04 Mar 2025

Towards Enhanced Image Generation Via Multi-modal Chain of Thought in Unified Generative Models

...

494

03 Mar 2025

UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface

459

03 Mar 2025

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Abdelrahman Abouelenin

...

302

294

03 Mar 2025

ABC: Achieving Better Control of Multimodal Embeddings using VLMs

Benjamin Schneider

Florian Kerschbaum

Wenhu Chen

975

01 Mar 2025

RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to ConcreteComputer Vision and Pattern Recognition (CVPR), 2025

...

496

28 Feb 2025

R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

493

27 Feb 2025

Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios

412

27 Feb 2025

M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance

...

632

26 Feb 2025

Detecting Knowledge Boundary of Vision Large Language Models by Sampling-Based Inference

443

25 Feb 2025

Introducing Visual Perception Token into Multimodal Large Language Model

334

24 Feb 2025

MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing

282

24 Feb 2025

PosterSum: A Multimodal Benchmark for Scientific Poster Summarization

249

24 Feb 2025

Capability Instruction Tuning: A New Paradigm for Dynamic LLM RoutingAAAI Conference on Artificial Intelligence (AAAI), 2025

469

24 Feb 2025

RobustMerge: Parameter-Efficient Model Merging for MLLMs with Direction Robustness

621

24 Feb 2025

Testing the Limits of Fine-Tuning for Improving Visual Cognition in Vision Language Models

Luca M. Schulze Buschoff

Konstantinos Voudouris

450

21 Feb 2025

TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba

370

21 Feb 2025

LOVA3: Learning to Visual Question Answering, Asking and AssessmentNeural Information Processing Systems (NeurIPS), 2024

417

21 Feb 2025

InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback

534

20 Feb 2025

Megrez-Omni Technical Report

...

235

19 Feb 2025

InsightVision: A Comprehensive, Multi-Level Chinese-based Benchmark for Evaluating Implicit Visual Semantics in Large Vision Language Models

183

19 Feb 2025

Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative Analysis

641

18 Feb 2025

Unhackable Temporal Rewarding for Scalable Video MLLMs

...

286

17 Feb 2025

Mitigating Visual Knowledge Forgetting in MLLM Instruction-tuning via Modality-decoupled Gradient Descent

...

346

17 Feb 2025

PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection

...

520

17 Feb 2025

Towards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models

388

17 Feb 2025

SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video UnderstandingInternational Conference on Learning Representations (ICLR), 2025

334

15 Feb 2025