Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2407.18003
Cited By
v1
v2
v3 (latest)
Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption
25 July 2024
Shi Luohe
Hongyi Zhang
Yao Yao
Z. Li
Zhao Hai
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption"
38 / 38 papers shown
Beluga: A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management
Xinjun Yang
Qingda Hu
Junru Li
Feifei Li
Yicong Zhu
...
Jian Dai
Yang Kong
J. Zhang
Guoqiang Xu
Qiang Liu
99
1
0
25 Nov 2025
XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression
Haoqi Yang
Yao Yao
Zuchao Li
Baoyuan Qi
Guoming Liu
Hai Zhao
MQ
127
1
0
13 Oct 2025
LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences
Wenbo Wu
Qingyi Si
Xiurui Pan
Y. Wang
Jie Zhang
VLM
103
0
0
13 Oct 2025
UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution
Shian Du
Menghan Xia
Chang-rui Liu
Quande Liu
Xintao Wang
Pengfei Wan
Xiangyang Ji
VGen
SupR
275
0
0
09 Oct 2025
The End of Transformers? On Challenging Attention and the Rise of Sub-Quadratic Architectures
Alexander Fichtl
Jeremias Bohn
Josefin Kelber
Edoardo Mosca
Georg Groh
132
0
0
06 Oct 2025
Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
Alessio Devoto
Maximilian Jeblick
Simon Jégou
MQ
VLM
108
4
0
01 Oct 2025
OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Oja's Rule
Yuxuan Zhu
David H. Yang
Mohammad Mohammadi Amiri
K. Murugesan
Tejaswini Pedapati
Pin-Yu Chen
VLM
175
0
0
25 Sep 2025
Efficient Uncertainty Estimation for LLM-based Entity Linking in Tabular Data
Carlo Bono
Federico Belotti
Matteo Palmonari
130
0
0
24 Sep 2025
Attention Beyond Neighborhoods: Reviving Transformer for Graph Clustering
Xuanting Xie
Bingheng Li
Erlin Pan
Rui Hou
Wenyu Chen
Zhao Kang
GNN
212
0
0
18 Sep 2025
A Comprehensive Review of Reinforcement Learning for Autonomous Driving in the CARLA Simulator
Elahe Delavari
Feeza Khan Khanzada
Jaerock Kwon
145
3
0
10 Sep 2025
Adaptive KV-Cache Compression without Manually Setting Budget
Chenxia Tang
Jianchun Liu
Hongli Xu
Liusheng Huang
113
0
0
03 Sep 2025
CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models
Zicong Tang
Ziyang Ma
Suqing Wang
Zuchao Li
Lefei Zhang
Hai Zhao
Yun Li
Qianren Wang
VLM
139
2
0
24 Aug 2025
CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing
Yixuan Wang
Haoyu Qiao
Lujun Li
Qingfu Zhu
Wanxiang Che
MQ
134
1
0
22 Aug 2025
SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning
Remote Sensing (RS), 2025
Huanxuan Liao
Yixing Xu
Shizhu He
Guanchen Li
Xuanwu Yin
Dong Li
E. Barsoum
Jun Zhao
Kang Liu
157
1
0
21 Aug 2025
TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation
Zhekai Chen
Ruihang Chu
Yukang Chen
Shiwei Zhang
Yujie Wei
Yingya Zhang
Xihui Liu
260
8
0
24 Jul 2025
MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Kunxi Li
Zhonghua Jiang
Zhouzhou Shen
Zhaode Wang
Chengfei Lv
Shengyu Zhang
Fan Wu
Fei Wu
VLM
206
2
0
06 Jun 2025
KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache
Fei Li
Song Liu
Weiguo Wu
Shiqiang Nie
Jinyu Wang
MQ
95
0
0
18 May 2025
KeepKV: Achieving Periodic Lossless KV Cache Compression for Efficient LLM Inference
Yuxuan Tian
Zihan Wang
Yebo Peng
Aomufei Yuan
Zhaoxiang Wang
Bairen Yi
Xin Liu
Yong Cui
Tong Yang
374
0
0
14 Apr 2025
SD
2
^2
2
: Self-Distilled Sparse Drafters
Mike Lasby
Nish Sinnadurai
Valavan Manohararajah
Sean Lie
Yani Andrew Ioannou
Vithursan Thangarasa
791
1
0
10 Apr 2025
SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching
Yuxuan Zhu
Ali Falahati
David H. Yang
Mohammad Mohammadi Amiri
318
1
0
01 Apr 2025
WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference
Youhui Zuo
Sibo Wei
C. Zhang
Zhuorui Liu
Sibo Wei
Dawei Song
VLM
421
1
0
23 Mar 2025
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Yang Sui
Yu-Neng Chuang
Guanchu Wang
Jiamu Zhang
Tianyi Zhang
...
Andrew Wen
Shaochen
Zhong
Hanjie Chen
Helen Zhou
OffRL
ReLM
LRM
758
273
0
20 Mar 2025
A Survey on Transformer Context Extension: Approaches and Evaluation
Yijun Liu
Jinzheng Yu
Yang Xu
Zhongyang Li
Qingfu Zhu
LLMAG
520
12
0
17 Mar 2025
X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression
Guihong Li
Mehdi Rezagholizadeh
Mingyu Yang
Vikram Appia
Emad Barsoum
VLM
381
1
0
14 Mar 2025
APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Yuxiang Huang
Mingye Li
Xu Han
Chaojun Xiao
Weilin Zhao
Sun Ao
Hao Zhou
Jie Zhou
Zhiyuan Liu
Maosong Sun
386
2
0
17 Feb 2025
Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding
Konstantin Berestizshevsky
Renzo Andri
Lukas Cavigelli
421
2
0
12 Feb 2025
TOPLOC: A Locality Sensitive Hashing Scheme for Trustless Verifiable Inference
Jack Min Ong
Matthew Di Ferrante
Aaron Pazdera
Ryan Garner
Sami Jaghouar
Manveer Basra
Max Ryabinin
Johannes Hagemann
LRM
348
7
0
27 Jan 2025
Taming Teacher Forcing for Masked Autoregressive Video Generation
Computer Vision and Pattern Recognition (CVPR), 2025
Deyu Zhou
Quan Sun
Yuang Peng
Kun Yan
Runpei Dong
...
Zheng Ge
Nan Duan
Xiangyu Zhang
L. Ni
H. Shum
VGen
389
19
0
21 Jan 2025
MPCache: MPC-Friendly KV Cache Eviction for Efficient Private LLM Inference
Wenxuan Zeng
Ye Dong
Jinjin Zhou
Jin Tan
Jin Tan
Tao Wei
Runsheng Wang
Meng Li
339
1
0
12 Jan 2025
Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format
International Symposium on High-Performance Computer Architecture (HPCA), 2024
Chao Fang
Man Shi
Robin Geens
Arne Symons
Zhongfeng Wang
Marian Verhelst
417
11
0
24 Nov 2024
An Evolved Universal Transformer Memory
International Conference on Learning Representations (ICLR), 2024
Edoardo Cetin
Qi Sun
Tianyu Zhao
Yujin Tang
1.3K
4
0
17 Oct 2024
MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection
International Conference on Learning Representations (ICLR), 2024
Bokai Lin
Zihao Zeng
Zipeng Xiao
Siqi Kou
Tianqi Hou
Xiaofeng Gao
Hao Zhang
Zhijie Deng
304
10
0
16 Oct 2024
Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices
Yuxiang Huang
Binhang Yuan
Xu Han
Chaojun Xiao
Zhiyuan Liu
RALM
472
11
0
02 Oct 2024
An overview of domain-specific foundation model: key technologies, applications and challenges
Science China Information Sciences (Sci. China Inf. Sci.), 2024
Haolong Chen
Hanzhi Chen
Zijian Zhao
Kaifeng Han
Guangxu Zhu
Yichen Zhao
Ying Du
Wei Xu
Qingjiang Shi
ALM
VLM
489
19
0
06 Sep 2024
Multi-Turn Interactions for Text-to-SQL with Large Language Models
Guanming Xiong
Junwei Bao
Hongfei Jiang
Yang Song
Wen Zhao
LRM
370
2
0
09 Aug 2024
ThinK: Thinner Key Cache by Query-Driven Pruning
International Conference on Learning Representations (ICLR), 2024
Yuhui Xu
Zhanming Jie
Hanze Dong
Lei Wang
Xudong Lu
Aojun Zhou
Amrita Saha
Caiming Xiong
Doyen Sahoo
533
41
0
30 Jul 2024
Yi: Open Foundation Models by 01.AI
01. AI
Alex Young
01.AI Alex Young
Bei Chen
Chao Li
...
Yue Wang
Yuxuan Cai
Zhenyu Gu
Zhiyuan Liu
Zonghong Dai
OSLM
LRM
840
768
0
07 Mar 2024
Fast Transformer Decoding: One Write-Head is All You Need
Noam M. Shazeer
599
641
0
06 Nov 2019
1
Page 1 of 1