ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2407.01527
  4. Cited By
KV Cache Compression, But What Must We Give in Return? A Comprehensive
  Benchmark of Long Context Capable Approaches

KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches

1 July 2024
Jiayi Yuan
Hongyi Liu
Shaochen
Zhong
Yu-Neng Chuang
Songchen Li
Guanchu Wang
Duy Le
Hongye Jin
V. Chaudhary
Zhaozhuo Xu
Zirui Liu
Xia Hu
ArXivPDFHTML

Papers citing "KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches"

19 / 19 papers shown
Title
The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
Piotr Nawrot
Robert Li
Renjie Huang
Sebastian Ruder
Kelly Marchisio
E. Ponti
20
0
0
24 Apr 2025
LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important
LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important
Manlai Liang
JiaMing Zhang
Xiong Li
Jinlong Li
MQ
28
0
0
07 Apr 2025
SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs
SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs
Shibo Jie
Yehui Tang
Kai Han
Zhi-Hong Deng
Jing Han
87
0
0
20 Mar 2025
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Yang Sui
Yu-Neng Chuang
Guanchu Wang
Jiamu Zhang
Tianyi Zhang
...
Hongyi Liu
Andrew Wen
Shaochen
Zhong
Hanjie Chen
OffRL
ReLM
LRM
58
21
0
20 Mar 2025
Cost-Optimal Grouped-Query Attention for Long-Context LLMs
Y. Chen
Yutong Wu
Xu Han
Zhiyuan Liu
Maosong Sun
59
0
0
12 Mar 2025
Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs
Ravi Ghadia
Avinash Kumar
Gaurav Jain
Prashant J. Nair
Poulami Das
29
1
0
02 Mar 2025
The Lottery LLM Hypothesis, Rethinking What Abilities Should LLM Compression Preserve?
The Lottery LLM Hypothesis, Rethinking What Abilities Should LLM Compression Preserve?
Zhenheng Tang
Xiang Liu
Qian Wang
Peijie Dong
Bingsheng He
Xiaowen Chu
Bo Li
LRM
47
1
0
24 Feb 2025
Can LLMs Maintain Fundamental Abilities under KV Cache Compression?
Can LLMs Maintain Fundamental Abilities under KV Cache Compression?
Xiang Liu
Zhenheng Tang
Hong Chen
Peijie Dong
Zeyu Li
Xiuze Zhou
Bo Li
Xuming Hu
Xiaowen Chu
68
3
0
04 Feb 2025
Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped
  Activation Data Format
Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format
Chao Fang
Man Shi
Robin Geens
Arne Symons
Zhongfeng Wang
Marian Verhelst
64
0
0
24 Nov 2024
Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices
Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices
Yuxiang Huang
Binhang Yuan
Xu Han
Chaojun Xiao
Zhiyuan Liu
RALM
67
1
0
02 Oct 2024
ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing
  End-to-End Efficiency
ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency
Yuhang Yao
Han Jin
Alay Dilipbhai Shah
Shanshan Han
Zijian Hu
Yide Ran
Dimitris Stripelis
Zhaozhuo Xu
Salman Avestimehr
Chang D. Yoo
24
1
0
23 Jul 2024
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero
  Overhead
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead
A. Zandieh
Majid Daliri
Insu Han
MQ
25
2
0
05 Jun 2024
SnapKV: LLM Knows What You are Looking for Before Generation
SnapKV: LLM Knows What You are Looking for Before Generation
Yuhong Li
Yingbing Huang
Bowen Yang
Bharat Venkitesh
Acyr F. Locatelli
Hanchen Ye
Tianle Cai
Patrick Lewis
Deming Chen
VLM
73
148
0
22 Apr 2024
HGRN2: Gated Linear RNNs with State Expansion
HGRN2: Gated Linear RNNs with State Expansion
Zhen Qin
Songlin Yang
Weixuan Sun
Xuyang Shen
Dong Li
Weigao Sun
Yiran Zhong
LRM
34
45
0
11 Apr 2024
QAQ: Quality Adaptive Quantization for LLM KV Cache
QAQ: Quality Adaptive Quantization for LLM KV Cache
Shichen Dong
Wenfang Cheng
Jiayu Qin
Wei Wang
MQ
35
10
0
07 Mar 2024
Griffin: Mixing Gated Linear Recurrences with Local Attention for
  Efficient Language Models
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Soham De
Samuel L. Smith
Anushan Fernando
Aleksandar Botev
George-Christian Muraru
...
David Budden
Yee Whye Teh
Razvan Pascanu
Nando de Freitas
Çağlar Gülçehre
Mamba
51
116
0
29 Feb 2024
No Token Left Behind: Reliable KV Cache Compression via Importance-Aware
  Mixed Precision Quantization
No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization
J. Yang
Byeongwook Kim
Jeongin Bae
Beomseok Kwon
Gunho Park
Eunho Yang
S. Kwon
Dongsoo Lee
MQ
31
12
0
28 Feb 2024
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Zirui Liu
Jiayi Yuan
Hongye Jin
Shaochen Zhong
Zhaozhuo Xu
Vladimir Braverman
Beidi Chen
Xia Hu
MQ
11
155
0
05 Feb 2024
FlexGen: High-Throughput Generative Inference of Large Language Models
  with a Single GPU
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
Ying Sheng
Lianmin Zheng
Binhang Yuan
Zhuohan Li
Max Ryabinin
...
Joseph E. Gonzalez
Percy Liang
Christopher Ré
Ion Stoica
Ce Zhang
135
208
0
13 Mar 2023
1