v1v2 (latest)

VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration

30 January 2026

Hanxun Yu

Wentong Li

Xuan Qu

Song Wang

Junbo Chen

Jianke Zhu

ArXiv (abs)PDF HTML HuggingFace (5 upvotes)Github (8★)

Main:12 Pages

15 Figures

Bibliography:1 Pages

28 Tables

Appendix:17 Pages

Abstract

Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens, particularly in high-resolution and video-based scenarios. Existing token reduction methods typically focus on isolated pipeline components and often neglect textual alignment, leading to performance degradation. In this paper, we propose VisionTrim, a unified framework for training-free MLLM acceleration, integrating two effective plug-and-play modules: 1) the Dominant Vision Token Selection (DVTS) module, which preserves essential visual tokens via a global-local view, and 2) the Text-Guided Vision Complement (TGVC) module, which facilitates context-aware token merging guided by textual cues. Extensive experiments across diverse image and video multimodal benchmarks demonstrate the performance superiority of our VisionTrim, advancing practical MLLM deployment in real-world applications. The code is available at:this https URL.

View on arXiv

Comments on this paper