ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2407.18003
  4. Cited By
Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache
  Consumption
v1v2v3 (latest)

Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption

25 July 2024
Shi Luohe
Hongyi Zhang
Yao Yao
Z. Li
Zhao Hai
ArXiv (abs)PDFHTML

Papers citing "Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption"

36 / 36 papers shown
Title
XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression
XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression
Haoqi Yang
Yao Yao
Zuchao Li
Baoyuan Qi
Guoming Liu
Hai Zhao
MQ
92
1
0
13 Oct 2025
LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences
LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences
Wenbo Wu
Qingyi Si
Xiurui Pan
Y. Wang
Jie Zhang
VLM
32
0
0
13 Oct 2025
UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution
UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution
Shian Du
Menghan Xia
Chang-rui Liu
Quande Liu
Xintao Wang
Pengfei Wan
Xiangyang Ji
VGenSupR
200
0
0
09 Oct 2025
The End of Transformers? On Challenging Attention and the Rise of Sub-Quadratic Architectures
The End of Transformers? On Challenging Attention and the Rise of Sub-Quadratic Architectures
Alexander Fichtl
Jeremias Bohn
Josefin Kelber
Edoardo Mosca
Georg Groh
72
0
0
06 Oct 2025
Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
Alessio Devoto
Maximilian Jeblick
Simon Jégou
MQVLM
72
2
0
01 Oct 2025
OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Oja's Rule
OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Oja's Rule
Yuxuan Zhu
David H. Yang
Mohammad Mohammadi Amiri
K. Murugesan
Tejaswini Pedapati
Pin-Yu Chen
VLM
136
0
0
25 Sep 2025
Efficient Uncertainty Estimation for LLM-based Entity Linking in Tabular Data
Efficient Uncertainty Estimation for LLM-based Entity Linking in Tabular Data
Carlo Bono
Federico Belotti
Matteo Palmonari
92
0
0
24 Sep 2025
Attention Beyond Neighborhoods: Reviving Transformer for Graph Clustering
Attention Beyond Neighborhoods: Reviving Transformer for Graph Clustering
Xuanting Xie
Bingheng Li
Erlin Pan
Rui Hou
Wenyu Chen
Zhao Kang
GNN
156
0
0
18 Sep 2025
A Comprehensive Review of Reinforcement Learning for Autonomous Driving in the CARLA Simulator
A Comprehensive Review of Reinforcement Learning for Autonomous Driving in the CARLA Simulator
Elahe Delavari
Feeza Khan Khanzada
Jaerock Kwon
84
1
0
10 Sep 2025
Adaptive KV-Cache Compression without Manually Setting Budget
Adaptive KV-Cache Compression without Manually Setting Budget
Chenxia Tang
Jianchun Liu
Hongli Xu
Liusheng Huang
68
0
0
03 Sep 2025
CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models
CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models
Zicong Tang
Ziyang Ma
Suqing Wang
Zuchao Li
Lefei Zhang
Hai Zhao
Yun Li
Qianren Wang
VLM
101
1
0
24 Aug 2025
CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing
CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing
Yixuan Wang
Haoyu Qiao
Lujun Li
Qingfu Zhu
Wanxiang Che
MQ
100
0
0
22 Aug 2025
SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning
SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel PruningRemote Sensing (RS), 2025
Huanxuan Liao
Yixing Xu
Shizhu He
Guanchen Li
Xuanwu Yin
Dong Li
E. Barsoum
Jun Zhao
Kang Liu
115
1
0
21 Aug 2025
TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation
TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation
Zhekai Chen
Ruihang Chu
Yukang Chen
Shiwei Zhang
Yujie Wei
Yingya Zhang
Xihui Liu
207
6
0
24 Jul 2025
MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference
MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context InferenceAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Kunxi Li
Zhonghua Jiang
Zhouzhou Shen
Zhaode Wang
Chengfei Lv
Shengyu Zhang
Fan Wu
Fei Wu
VLM
140
1
0
06 Jun 2025
KeepKV: Achieving Periodic Lossless KV Cache Compression for Efficient LLM Inference
KeepKV: Achieving Periodic Lossless KV Cache Compression for Efficient LLM Inference
Yuxuan Tian
Zihan Wang
Yebo Peng
Aomufei Yuan
Zhaoxiang Wang
Bairen Yi
Xin Liu
Yong Cui
Tong Yang
274
0
0
14 Apr 2025
SD$^2$: Self-Distilled Sparse Drafters
SD2^22: Self-Distilled Sparse Drafters
Mike Lasby
Nish Sinnadurai
Valavan Manohararajah
Sean Lie
Yani Andrew Ioannou
Vithursan Thangarasa
705
1
0
10 Apr 2025
SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching
SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching
Yuxuan Zhu
Ali Falahati
David H. Yang
Mohammad Mohammadi Amiri
215
1
0
01 Apr 2025
WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference
WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference
Youhui Zuo
Sibo Wei
C. Zhang
Zhuorui Liu
Sibo Wei
Dawei Song
VLM
322
1
0
23 Mar 2025
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Yang Sui
Yu-Neng Chuang
Guanchu Wang
Jiamu Zhang
Tianyi Zhang
...
Andrew Wen
Shaochen
Zhong
Hanjie Chen
Helen Zhou
OffRLReLMLRM
588
245
0
20 Mar 2025
A Survey on Transformer Context Extension: Approaches and Evaluation
A Survey on Transformer Context Extension: Approaches and Evaluation
Yijun Liu
Jinzheng Yu
Yang Xu
Zhongyang Li
Qingfu Zhu
LLMAG
413
11
0
17 Mar 2025
X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression
X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression
Guihong Li
Mehdi Rezagholizadeh
Mingyu Yang
Vikram Appia
Emad Barsoum
VLM
313
1
0
14 Mar 2025
APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs
APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Yuxiang Huang
Mingye Li
Xu Han
Chaojun Xiao
Weilin Zhao
Sun Ao
Hao Zhou
Jie Zhou
Zhiyuan Liu
Maosong Sun
311
1
0
17 Feb 2025
Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding
Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding
Konstantin Berestizshevsky
Renzo Andri
Lukas Cavigelli
325
2
0
12 Feb 2025
TOPLOC: A Locality Sensitive Hashing Scheme for Trustless Verifiable Inference
TOPLOC: A Locality Sensitive Hashing Scheme for Trustless Verifiable Inference
Jack Min Ong
Matthew Di Ferrante
Aaron Pazdera
Ryan Garner
Sami Jaghouar
Manveer Basra
Max Ryabinin
Johannes Hagemann
LRM
309
7
0
27 Jan 2025
Taming Teacher Forcing for Masked Autoregressive Video Generation
Taming Teacher Forcing for Masked Autoregressive Video GenerationComputer Vision and Pattern Recognition (CVPR), 2025
Deyu Zhou
Quan Sun
Yuang Peng
Kun Yan
Runpei Dong
...
Zheng Ge
Nan Duan
Xiangyu Zhang
L. Ni
H. Shum
VGen
309
18
0
21 Jan 2025
MPCache: MPC-Friendly KV Cache Eviction for Efficient Private LLM Inference
MPCache: MPC-Friendly KV Cache Eviction for Efficient Private LLM Inference
Wenxuan Zeng
Ye Dong
Jinjin Zhou
Jin Tan
Jin Tan
Tao Wei
Runsheng Wang
Meng Li
260
1
0
12 Jan 2025
Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped
  Activation Data Format
Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data FormatInternational Symposium on High-Performance Computer Architecture (HPCA), 2024
Chao Fang
Man Shi
Robin Geens
Arne Symons
Zhongfeng Wang
Marian Verhelst
362
7
0
24 Nov 2024
An Evolved Universal Transformer Memory
An Evolved Universal Transformer MemoryInternational Conference on Learning Representations (ICLR), 2024
Edoardo Cetin
Qi Sun
Tianyu Zhao
Yujin Tang
1.1K
3
0
17 Oct 2024
MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection
MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal ProjectionInternational Conference on Learning Representations (ICLR), 2024
Bokai Lin
Zihao Zeng
Zipeng Xiao
Siqi Kou
Tianqi Hou
Xiaofeng Gao
Hao Zhang
Zhijie Deng
236
9
0
16 Oct 2024
Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices
Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices
Yuxiang Huang
Binhang Yuan
Xu Han
Chaojun Xiao
Zhiyuan Liu
RALM
362
10
0
02 Oct 2024
An overview of domain-specific foundation model: key technologies, applications and challenges
An overview of domain-specific foundation model: key technologies, applications and challengesScience China Information Sciences (Sci. China Inf. Sci.), 2024
Haolong Chen
Hanzhi Chen
Zijian Zhao
Kaifeng Han
Guangxu Zhu
Yichen Zhao
Ying Du
Wei Xu
Qingjiang Shi
ALMVLM
380
15
0
06 Sep 2024
Multi-Turn Interactions for Text-to-SQL with Large Language Models
Multi-Turn Interactions for Text-to-SQL with Large Language Models
Guanming Xiong
Junwei Bao
Hongfei Jiang
Yang Song
Wen Zhao
LRM
268
2
0
09 Aug 2024
ThinK: Thinner Key Cache by Query-Driven Pruning
ThinK: Thinner Key Cache by Query-Driven PruningInternational Conference on Learning Representations (ICLR), 2024
Yuhui Xu
Zhanming Jie
Hanze Dong
Lei Wang
Xudong Lu
Aojun Zhou
Amrita Saha
Caiming Xiong
Doyen Sahoo
404
37
0
30 Jul 2024
Yi: Open Foundation Models by 01.AI
Yi: Open Foundation Models by 01.AI
01. AI
Alex Young
01.AI Alex Young
Bei Chen
Chao Li
...
Yue Wang
Yuxuan Cai
Zhenyu Gu
Zhiyuan Liu
Zonghong Dai
OSLMLRM
689
751
0
07 Mar 2024
Fast Transformer Decoding: One Write-Head is All You Need
Fast Transformer Decoding: One Write-Head is All You Need
Noam M. Shazeer
488
616
0
06 Nov 2019
1