ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2407.18003
  4. Cited By
Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache
  Consumption
v1v2v3 (latest)

Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption

25 July 2024
Shi Luohe
Hongyi Zhang
Yao Yao
Z. Li
Zhao Hai
ArXiv (abs)PDFHTML

Papers citing "Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption"

38 / 38 papers shown
Beluga: A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management
Beluga: A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management
Xinjun Yang
Qingda Hu
Junru Li
Feifei Li
Yicong Zhu
...
Jian Dai
Yang Kong
J. Zhang
Guoqiang Xu
Qiang Liu
99
1
0
25 Nov 2025
XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression
XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression
Haoqi Yang
Yao Yao
Zuchao Li
Baoyuan Qi
Guoming Liu
Hai Zhao
MQ
127
1
0
13 Oct 2025
LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences
LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences
Wenbo Wu
Qingyi Si
Xiurui Pan
Y. Wang
Jie Zhang
VLM
103
0
0
13 Oct 2025
UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution
UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution
Shian Du
Menghan Xia
Chang-rui Liu
Quande Liu
Xintao Wang
Pengfei Wan
Xiangyang Ji
VGenSupR
275
0
0
09 Oct 2025
The End of Transformers? On Challenging Attention and the Rise of Sub-Quadratic Architectures
The End of Transformers? On Challenging Attention and the Rise of Sub-Quadratic Architectures
Alexander Fichtl
Jeremias Bohn
Josefin Kelber
Edoardo Mosca
Georg Groh
132
0
0
06 Oct 2025
Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
Alessio Devoto
Maximilian Jeblick
Simon Jégou
MQVLM
108
4
0
01 Oct 2025
OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Oja's Rule
OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Oja's Rule
Yuxuan Zhu
David H. Yang
Mohammad Mohammadi Amiri
K. Murugesan
Tejaswini Pedapati
Pin-Yu Chen
VLM
175
0
0
25 Sep 2025
Efficient Uncertainty Estimation for LLM-based Entity Linking in Tabular Data
Efficient Uncertainty Estimation for LLM-based Entity Linking in Tabular Data
Carlo Bono
Federico Belotti
Matteo Palmonari
130
0
0
24 Sep 2025
Attention Beyond Neighborhoods: Reviving Transformer for Graph Clustering
Attention Beyond Neighborhoods: Reviving Transformer for Graph Clustering
Xuanting Xie
Bingheng Li
Erlin Pan
Rui Hou
Wenyu Chen
Zhao Kang
GNN
212
0
0
18 Sep 2025
A Comprehensive Review of Reinforcement Learning for Autonomous Driving in the CARLA Simulator
A Comprehensive Review of Reinforcement Learning for Autonomous Driving in the CARLA Simulator
Elahe Delavari
Feeza Khan Khanzada
Jaerock Kwon
145
3
0
10 Sep 2025
Adaptive KV-Cache Compression without Manually Setting Budget
Adaptive KV-Cache Compression without Manually Setting Budget
Chenxia Tang
Jianchun Liu
Hongli Xu
Liusheng Huang
113
0
0
03 Sep 2025
CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models
CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models
Zicong Tang
Ziyang Ma
Suqing Wang
Zuchao Li
Lefei Zhang
Hai Zhao
Yun Li
Qianren Wang
VLM
139
2
0
24 Aug 2025
CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing
CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing
Yixuan Wang
Haoyu Qiao
Lujun Li
Qingfu Zhu
Wanxiang Che
MQ
134
1
0
22 Aug 2025
SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning
SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel PruningRemote Sensing (RS), 2025
Huanxuan Liao
Yixing Xu
Shizhu He
Guanchen Li
Xuanwu Yin
Dong Li
E. Barsoum
Jun Zhao
Kang Liu
157
1
0
21 Aug 2025
TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation
TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation
Zhekai Chen
Ruihang Chu
Yukang Chen
Shiwei Zhang
Yujie Wei
Yingya Zhang
Xihui Liu
260
8
0
24 Jul 2025
MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference
MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context InferenceAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Kunxi Li
Zhonghua Jiang
Zhouzhou Shen
Zhaode Wang
Chengfei Lv
Shengyu Zhang
Fan Wu
Fei Wu
VLM
206
2
0
06 Jun 2025
KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache
Fei Li
Song Liu
Weiguo Wu
Shiqiang Nie
Jinyu Wang
MQ
95
0
0
18 May 2025
KeepKV: Achieving Periodic Lossless KV Cache Compression for Efficient LLM Inference
KeepKV: Achieving Periodic Lossless KV Cache Compression for Efficient LLM Inference
Yuxuan Tian
Zihan Wang
Yebo Peng
Aomufei Yuan
Zhaoxiang Wang
Bairen Yi
Xin Liu
Yong Cui
Tong Yang
374
0
0
14 Apr 2025
SD$^2$: Self-Distilled Sparse Drafters
SD2^22: Self-Distilled Sparse Drafters
Mike Lasby
Nish Sinnadurai
Valavan Manohararajah
Sean Lie
Yani Andrew Ioannou
Vithursan Thangarasa
791
1
0
10 Apr 2025
SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching
SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching
Yuxuan Zhu
Ali Falahati
David H. Yang
Mohammad Mohammadi Amiri
318
1
0
01 Apr 2025
WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference
WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference
Youhui Zuo
Sibo Wei
C. Zhang
Zhuorui Liu
Sibo Wei
Dawei Song
VLM
421
1
0
23 Mar 2025
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Yang Sui
Yu-Neng Chuang
Guanchu Wang
Jiamu Zhang
Tianyi Zhang
...
Andrew Wen
Shaochen
Zhong
Hanjie Chen
Helen Zhou
OffRLReLMLRM
758
273
0
20 Mar 2025
A Survey on Transformer Context Extension: Approaches and Evaluation
A Survey on Transformer Context Extension: Approaches and Evaluation
Yijun Liu
Jinzheng Yu
Yang Xu
Zhongyang Li
Qingfu Zhu
LLMAG
520
12
0
17 Mar 2025
X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression
X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression
Guihong Li
Mehdi Rezagholizadeh
Mingyu Yang
Vikram Appia
Emad Barsoum
VLM
381
1
0
14 Mar 2025
APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs
APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Yuxiang Huang
Mingye Li
Xu Han
Chaojun Xiao
Weilin Zhao
Sun Ao
Hao Zhou
Jie Zhou
Zhiyuan Liu
Maosong Sun
386
2
0
17 Feb 2025
Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding
Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding
Konstantin Berestizshevsky
Renzo Andri
Lukas Cavigelli
421
2
0
12 Feb 2025
TOPLOC: A Locality Sensitive Hashing Scheme for Trustless Verifiable Inference
TOPLOC: A Locality Sensitive Hashing Scheme for Trustless Verifiable Inference
Jack Min Ong
Matthew Di Ferrante
Aaron Pazdera
Ryan Garner
Sami Jaghouar
Manveer Basra
Max Ryabinin
Johannes Hagemann
LRM
348
7
0
27 Jan 2025
Taming Teacher Forcing for Masked Autoregressive Video Generation
Taming Teacher Forcing for Masked Autoregressive Video GenerationComputer Vision and Pattern Recognition (CVPR), 2025
Deyu Zhou
Quan Sun
Yuang Peng
Kun Yan
Runpei Dong
...
Zheng Ge
Nan Duan
Xiangyu Zhang
L. Ni
H. Shum
VGen
389
19
0
21 Jan 2025
MPCache: MPC-Friendly KV Cache Eviction for Efficient Private LLM Inference
MPCache: MPC-Friendly KV Cache Eviction for Efficient Private LLM Inference
Wenxuan Zeng
Ye Dong
Jinjin Zhou
Jin Tan
Jin Tan
Tao Wei
Runsheng Wang
Meng Li
339
1
0
12 Jan 2025
Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped
  Activation Data Format
Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data FormatInternational Symposium on High-Performance Computer Architecture (HPCA), 2024
Chao Fang
Man Shi
Robin Geens
Arne Symons
Zhongfeng Wang
Marian Verhelst
417
11
0
24 Nov 2024
An Evolved Universal Transformer Memory
An Evolved Universal Transformer MemoryInternational Conference on Learning Representations (ICLR), 2024
Edoardo Cetin
Qi Sun
Tianyu Zhao
Yujin Tang
1.3K
4
0
17 Oct 2024
MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection
MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal ProjectionInternational Conference on Learning Representations (ICLR), 2024
Bokai Lin
Zihao Zeng
Zipeng Xiao
Siqi Kou
Tianqi Hou
Xiaofeng Gao
Hao Zhang
Zhijie Deng
304
10
0
16 Oct 2024
Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices
Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices
Yuxiang Huang
Binhang Yuan
Xu Han
Chaojun Xiao
Zhiyuan Liu
RALM
472
11
0
02 Oct 2024
An overview of domain-specific foundation model: key technologies, applications and challenges
An overview of domain-specific foundation model: key technologies, applications and challengesScience China Information Sciences (Sci. China Inf. Sci.), 2024
Haolong Chen
Hanzhi Chen
Zijian Zhao
Kaifeng Han
Guangxu Zhu
Yichen Zhao
Ying Du
Wei Xu
Qingjiang Shi
ALMVLM
489
19
0
06 Sep 2024
Multi-Turn Interactions for Text-to-SQL with Large Language Models
Multi-Turn Interactions for Text-to-SQL with Large Language Models
Guanming Xiong
Junwei Bao
Hongfei Jiang
Yang Song
Wen Zhao
LRM
370
2
0
09 Aug 2024
ThinK: Thinner Key Cache by Query-Driven Pruning
ThinK: Thinner Key Cache by Query-Driven PruningInternational Conference on Learning Representations (ICLR), 2024
Yuhui Xu
Zhanming Jie
Hanze Dong
Lei Wang
Xudong Lu
Aojun Zhou
Amrita Saha
Caiming Xiong
Doyen Sahoo
533
41
0
30 Jul 2024
Yi: Open Foundation Models by 01.AI
Yi: Open Foundation Models by 01.AI
01. AI
Alex Young
01.AI Alex Young
Bei Chen
Chao Li
...
Yue Wang
Yuxuan Cai
Zhenyu Gu
Zhiyuan Liu
Zonghong Dai
OSLMLRM
840
768
0
07 Mar 2024
Fast Transformer Decoding: One Write-Head is All You Need
Fast Transformer Decoding: One Write-Head is All You Need
Noam M. Shazeer
599
641
0
06 Nov 2019
1
Page 1 of 1