21

PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs

Zijing Wang
Yongkang Liu
Mingyang Wang
Ercong Nie
Deyuan Chen
Zhengjie Zhao
Shi Feng
Daling Wang
Xiaocui Yang
Yifei Zhang
Hinrich Schütze
Main:8 Pages
9 Figures
Bibliography:3 Pages
7 Tables
Appendix:5 Pages
Abstract

Multimodal Large Language Models (MLLMs) rely on strong linguistic reasoning inherited from their base language models. However, multimodal instruction fine-tuning paradoxically degrades this text's reasoning capability, undermining multimodal performance. To address this issue, we propose a training-free framework to mitigate this degradation. Through layer-wise vision token masking, we reveal a common three-stage pattern in multimodal large language models: early-modal separation, mid-modal alignment, and late-modal degradation. By analyzing the behavior of MLLMs at different stages, we propose a plateau-guided model merging method that selectively injects base language model parameters into MLLMs. Experimental results based on five MLLMs on nine benchmarks demonstrate the effectiveness of our method. Attention-based analysis further reveals that merging shifts attention from diffuse, scattered patterns to focused localization on task-relevant visual regions. Our repository is onthis https URL.

View on arXiv
Comments on this paper