63
3

Can LLMs Maintain Fundamental Abilities under KV Cache Compression?

Abstract

This paper investigates an under-explored challenge in large language models (LLMs): the impact of KV cache compression methods on LLMs' fundamental capabilities. While existing methods achieve impressive compression ratios on long-context benchmarks, their effects on core model capabilities remain understudied. We present a comprehensive empirical study evaluating prominent KV cache compression methods across diverse tasks, spanning world knowledge, commonsense reasoning, arithmetic reasoning, code generation, safety, and long-context understanding andthis http URLanalysis reveals that KV cache compression methods exhibit task-specific performance degradation. Arithmetic reasoning tasks prove particularly sensitive to aggressive compression, with different methods showing performance drops of 17.4%17.4\%-43.3%43.3\%. Notably, the DeepSeek R1 Distill model exhibits more robust compression tolerance compared to instruction-tuned models, showing only 9.67%9.67\%-25.53%25.53\% performance degradation. Based on our analysis of attention patterns and cross-task compression performance, we propose ShotKV, a novel compression approach that distinctly handles prefill and decoding phases while maintaining shot-level semantic coherence. Empirical results show that ShotKV achieves 9%9\%-18%18\% performance improvements on long-context generation tasks under aggressive compression ratios.

View on arXiv
@article{liu2025_2502.01941,
  title={ Can LLMs Maintain Fundamental Abilities under KV Cache Compression? },
  author={ Xiang Liu and Zhenheng Tang and Hong Chen and Peijie Dong and Zeyu Li and Xiuze Zhou and Bo Li and Xuming Hu and Xiaowen Chu },
  journal={arXiv preprint arXiv:2502.01941},
  year={ 2025 }
}
Comments on this paper