410
v1v2v3 (latest)

Evaluating and Steering Modality Preferences in Multimodal Large Language Model

Main:1 Pages
18 Figures
19 Tables
Appendix:27 Pages
Abstract

Multi-modal large language models (MLLMs) have achieved remarkable success on complex multi-modal tasks. However, it remains insufficiently explored whether they exhibit modality preference\textbf{modality preference}, a tendency to favor one modality over another when processing multi-modal contexts. To study this question, we introduce MC\textsuperscript2\textbf{MC\textsuperscript{2}} benchmark, which constructs controlled evidence-conflict scenarios to systematically evaluate modality preference in decision-making. Extensive experiments reveal that all 20 tested MLLMs generally demonstrate clear modality preferences, and such preferences can serve as a useful indicator of downstream task performance of MLLMs. Further analysis shows that modality preference can be controlled by instruction guidance and captured within the latent representations of MLLMs. Built on these insights, we propose a probing and steering method based on representation engineering to explicitly control modality preference without requiring additional fine-tuning. This method effectively amplifies modality preference toward a desired direction and demonstrates promising improvements across multiple multi-modal understanding and reasoning tasks.

View on arXiv
Comments on this paper