KVCompose: Efficient Structured KV Cache Compression with Composite Tokens
- MQ

Main:8 Pages
12 Figures
Bibliography:2 Pages
6 Tables
Appendix:11 Pages
Abstract
Large language models (LLMs) rely on key-value (KV) caches for efficient autoregressive decoding; however, cache size grows linearly with context length and model depth, becoming a major bottleneck in long-context inference. Prior KV cache compression methods either enforce rigid heuristics, disrupt tensor layouts with per-attention-head variability, or require specialized compute kernels.
View on arXivComments on this paper