Inference-Time Hyper-Scaling with KV Cache Compression

v1v2 (latest)

Inference-Time Hyper-Scaling with KV Cache Compression

5 June 2025

Adrian Łańcucki

Konrad Staniszewski

ArXiv (abs)PDF HTML HuggingFace (27 upvotes)Github (25621★)

Papers citing "Inference-Time Hyper-Scaling with KV Cache Compression"

6 / 6 papers shown

Title
Attention and Compression is all you need for Controllably Efficient Language Models Jatin Prakash A. Puli Rajesh Ranganath MQ VLM 434 0 0 07 Nov 2025
KV Cache Transform Coding for Compact Storage in LLM Inference Konrad Staniszewski Adrian Łańcucki VLM 260 0 0 03 Nov 2025
Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction Mutian He Philip N. Garner CLL 232 0 0 23 Oct 2025
AsyncSpade: Efficient Test-Time Scaling with Asynchronous Sparse Decoding Shuqing Luo Yilin Guan Pingzhi Li Hanrui Wang Tianlong Chen 108 0 0 08 Oct 2025
On the Role of Temperature Sampling in Test-Time Scaling Yuheng Wu Azalia Mirhoseini Thierry Tambe ALM LRM 89 1 1 02 Oct 2025
Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution Alessio Devoto Maximilian Jeblick Simon Jégou MQ VLM 84 2 0 01 Oct 2025