v1v2v3 (latest)

Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption

25 July 2024

Papers citing "Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption"

38 / 38 papers shown

Beluga: A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management

...

25 Nov 2025

XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression

127

13 Oct 2025

LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences

103

13 Oct 2025

UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution

275

09 Oct 2025

The End of Transformers? On Challenging Attention and the Rise of Sub-Quadratic Architectures

132

06 Oct 2025

Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution

108

01 Oct 2025

OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Oja's Rule

Yuxuan Zhu

David H. Yang

Mohammad Mohammadi Amiri

175

25 Sep 2025

Efficient Uncertainty Estimation for LLM-based Entity Linking in Tabular Data

Carlo Bono

Federico Belotti

Matteo Palmonari

130

24 Sep 2025

Attention Beyond Neighborhoods: Reviving Transformer for Graph Clustering

212

18 Sep 2025

A Comprehensive Review of Reinforcement Learning for Autonomous Driving in the CARLA Simulator

Elahe Delavari

Feeza Khan Khanzada

Jaerock Kwon

145

10 Sep 2025

Adaptive KV-Cache Compression without Manually Setting Budget

113

03 Sep 2025

CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models

139

24 Aug 2025

CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing

134

22 Aug 2025

SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel PruningRemote Sensing (RS), 2025

157

21 Aug 2025

TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation

260

24 Jul 2025

MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context InferenceAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

206

06 Jun 2025

KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache

18 May 2025

KeepKV: Achieving Periodic Lossless KV Cache Compression for Efficient LLM Inference

374

14 Apr 2025

^2

: Self-Distilled Sparse Drafters

Mike Lasby

Nish Sinnadurai

Valavan Manohararajah

Sean Lie

Yani Andrew Ioannou

Vithursan Thangarasa

791

10 Apr 2025

SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching

Yuxuan Zhu

Ali Falahati

David H. Yang

Mohammad Mohammadi Amiri

318

01 Apr 2025

WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference

421

23 Mar 2025

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

...

758

273

20 Mar 2025

A Survey on Transformer Context Extension: Approaches and Evaluation

520

17 Mar 2025

X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression

381

14 Mar 2025

APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

386

17 Feb 2025

Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding

Konstantin Berestizshevsky

Renzo Andri

Lukas Cavigelli

421

12 Feb 2025

TOPLOC: A Locality Sensitive Hashing Scheme for Trustless Verifiable Inference

348

27 Jan 2025

Taming Teacher Forcing for Masked Autoregressive Video GenerationComputer Vision and Pattern Recognition (CVPR), 2025

...

389

21 Jan 2025

MPCache: MPC-Friendly KV Cache Eviction for Efficient Private LLM Inference

339

12 Jan 2025

Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data FormatInternational Symposium on High-Performance Computer Architecture (HPCA), 2024

417

24 Nov 2024

An Evolved Universal Transformer MemoryInternational Conference on Learning Representations (ICLR), 2024

1.3K

17 Oct 2024

MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal ProjectionInternational Conference on Learning Representations (ICLR), 2024

304

16 Oct 2024

Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices

Yuxiang Huang

Binhang Yuan

Xu Han

Chaojun Xiao

Zhiyuan Liu

RALM

472

02 Oct 2024

An overview of domain-specific foundation model: key technologies, applications and challengesScience China Information Sciences (Sci. China Inf. Sci.), 2024

489

06 Sep 2024

Multi-Turn Interactions for Text-to-SQL with Large Language Models

370

09 Aug 2024

ThinK: Thinner Key Cache by Query-Driven PruningInternational Conference on Learning Representations (ICLR), 2024

533

30 Jul 2024

Yi: Open Foundation Models by 01.AI

...

840

768

07 Mar 2024

Fast Transformer Decoding: One Write-Head is All You Need

Noam M. Shazeer

599

641

06 Nov 2019