Silent Hazards of Token Reduction in Vision-Language Models: The Hidden Impact on Consistency

9 March 2025

Abstract

Vision language models (VLMs) have excelled in visual reasoning but often incur high computational costs. One key reason is the redundancy of visual tokens. Although recent token reduction methods claim to achieve minimal performance loss, our extensive experiments reveal that token reduction can substantially alter a model's output distribution, leading to changes in prediction patterns that standard metrics such as accuracy loss do not fully capture. Such inconsistencies are especially concerning for practical applications where system stability is critical. To investigate this phenomenon, we analyze how token reduction influences the energy distribution of a VLM's internal representations using a lower-rank approximation via Singular Value Decomposition (SVD). Our results show that changes in the Inverse Participation Ratio of the singular value spectrum are strongly correlated with the model's consistency after token reduction. Based on these insights, we propose LoFi--a training-free visual token reduction method that utilizes the leverage score from SVD for token pruning. Experimental evaluations demonstrate that LoFi not only reduces computational costs with minimal performance degradation but also significantly outperforms state-of-the-art methods in terms of output consistency.

View on arXiv

@article{sun2025_2503.06794,
  title={ Silent Hazards of Token Reduction in Vision-Language Models: The Hidden Impact on Consistency },
  author={ Yizheng Sun and Hao Li and Chang Xu and Chenghua Lin and Riza Batista-Navarro and Jingyuan Sun },
  journal={arXiv preprint arXiv:2503.06794},
  year={ 2025 }
}

Comments on this paper