CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs

15 February 2025

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance across diverse applications. However, their computational overhead during deployment remains a critical bottleneck. While Key-Value (KV) caching effectively trades memory for computation to enhance inference efficiency, the growing memory footprint from extensive KV caches significantly reduces throughput and restricts prolonged deployment on memory-constrained GPU devices. To address this challenge, we propose CalibQuant, a simple yet highly effective visual quantization strategy that drastically reduces both memory and computational overhead. Specifically, CalibQuant introduces an extreme 1-bit quantization scheme, complemented by novel post-scaling and calibration techniques tailored to the intrinsic patterns of KV caches, thereby ensuring high efficiency without compromising model performance. Leveraging Triton for runtime optimization, we achieve a 10x throughput increase on InternVL models. Our method is designed to be plug-and-play, seamlessly integrating with various existing MLLMs without requiring architectural changes. Extensive experiments confirm that our approach significantly reduces memory usage while maintaining computational efficiency and preserving multimodal capabilities. Codes are available atthis https URL.

View on arXiv

@article{han2025_2502.14882,
  title={ CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs },
  author={ Insu Han and Zeliang Zhang and Zhiyuan Wang and Yifan Zhu and Susan Liang and Jiani Liu and Haiting Lin and Mingjie Zhao and Chenliang Xu and Kun Wan and Wentian Zhao },
  journal={arXiv preprint arXiv:2502.14882},
  year={ 2025 }
}

Comments on this paper