KVCompose: Efficient Structured KV Cache Compression with Composite Tokens

5 September 2025

Dmitry Akulov

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)

Main:8 Pages

12 Figures

Bibliography:2 Pages

6 Tables

Appendix:11 Pages

Abstract

Large language models (LLMs) rely on key-value (KV) caches for efficient autoregressive decoding; however, cache size grows linearly with context length and model depth, becoming a major bottleneck in long-context inference. Prior KV cache compression methods either enforce rigid heuristics, disrupt tensor layouts with per-attention-head variability, or require specialized compute kernels.

View on arXiv

Comments on this paper