ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.12065
  4. Cited By
WKVQuant: Quantizing Weight and Key/Value Cache for Large Language
  Models Gains More
v1v2 (latest)

WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More

19 February 2024
Yuxuan Yue
Zhihang Yuan
Haojie Duanmu
Sifan Zhou
Yue Yu
Liqiang Nie
    MQ
ArXiv (abs)PDFHTML

Papers citing "WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More"

34 / 34 papers shown
Title
Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving
Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving
Hui Zeng
Daming Zhao
Pengfei Yang
WenXuan Hou
Tianyang Zheng
Hui Li
Weiye Ji
Jidong Zhai
128
1
0
08 Nov 2025
FlashEVA: Accelerating LLM inference via Efficient Attention
FlashEVA: Accelerating LLM inference via Efficient Attention
Juan Gabriel Kostelec
Qinghai Guo
93
0
0
01 Nov 2025
Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
Wenjie Du
Li Jiang
Keda Tao
Xue Liu
Huan Wang
LRM
56
0
0
09 Oct 2025
Mitigating Diffusion Model Hallucinations with Dynamic Guidance
Mitigating Diffusion Model Hallucinations with Dynamic Guidance
Kostas Triaridis
Alexandros Graikos
Aggelina Chatziagapi
Grigorios G. Chrysos
Dimitris Samaras
DiffM
78
0
0
06 Oct 2025
Interpreting the Effects of Quantization on LLMs
Interpreting the Effects of Quantization on LLMs
Manpreet Singh
Hassan Sajjad
MQMILM
233
0
0
22 Aug 2025
SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference
SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference
Yi Zhao
Yajuan Peng
Cam-Tu Nguyen
Zuchao Li
Xiaoliang Wang
Hai Zhao
Xiaoming Fu
133
2
0
03 Aug 2025
ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration
ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration
Xianglong Yan
Zhiteng Li
Tianao Zhang
Linghe Kong
Yulun Zhang
Yulun Zhang
Yunbo Wang
295
3
0
30 May 2025
Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression
Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression
Kunjun Li
Zigeng Chen
Cheng-Yen Yang
Jenq-Neng Hwang
186
3
0
26 May 2025
MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design
MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design
Haojie Duanmu
Xiuhong Li
Zhihang Yuan
Size Zheng
Jiangfei Duan
Xingcheng Zhang
Dahua Lin
MQMoE
858
7
0
09 May 2025
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
A. Zandieh
Majid Daliri
Majid Hadian
Vahab Mirrokni
MQ
229
0
0
28 Apr 2025
Head-Aware KV Cache Compression for Efficient Visual Autoregressive Modeling
Head-Aware KV Cache Compression for Efficient Visual Autoregressive Modeling
Ziran Qin
Youru Lv
Mingbao Lin
Zeren Zhang
Danping Zou
Weiyao Lin
Weiyao Lin
VLM
241
5
0
12 Apr 2025
SQuat: Subspace-orthogonal KV Cache Quantization
SQuat: Subspace-orthogonal KV Cache Quantization
Hao Wang
Ligong Han
Kai Xu
Akash Srivastava
MQ
281
2
0
31 Mar 2025
Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models
Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models
Keda Tao
Haoxuan You
Yang Sui
Can Qin
Haoyu Wang
VLMMQ
287
7
0
20 Mar 2025
CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences
CAKE: Cascading and Adaptive KV Cache Eviction with Layer PreferencesInternational Conference on Learning Representations (ICLR), 2025
Ziran Qin
Yuchen Cao
Mingbao Lin
Wen Hu
Shixuan Fan
Ke Cheng
Weiyao Lin
Jianguo Li
217
23
0
16 Mar 2025
Binary Neural Networks for Large Language Model: A Survey
Binary Neural Networks for Large Language Model: A Survey
Liangdong Liu
Zhitong Zheng
Cong Wang
TianHuang Su
ZhenYu Yang
MQ
236
1
0
26 Feb 2025
Quantize What Counts: More for Keys, Less for Values
Quantize What Counts: More for Keys, Less for Values
Mohsen Hariri
Lam Nguyen
Sixu Chen
Shaochen Zhong
Qifan Wang
Helen Zhou
Xiaotian Han
Vipin Chaudhary
Vipin Chaudhary
MQ
299
0
0
20 Feb 2025
GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning
GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuningAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Sifan Zhou
Shuo Wang
Zhihang Yuan
Mingjia Shi
Yuzhang Shang
Dawei Yang
MQALM
512
10
0
18 Feb 2025
CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation
CSR:Achieving 1 Bit Key-Value Cache via Sparse RepresentationAAAI Conference on Artificial Intelligence (AAAI), 2024
Hongxuan Zhang
Yao Zhao
Jiaqi Zheng
Chenyi Zhuang
Jinjie Gu
Guihai Chen
MQ
212
1
0
16 Dec 2024
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Hanshi Sun
Li-Wen Chang
Yiyuan Ma
Wenlei Bao
Ningxin Zheng
Xin Liu
Harry Dong
Yuejie Chi
Beidi Chen
VLM
362
48
0
28 Oct 2024
AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise
  Asymmetric Quantization Configurations
AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization ConfigurationsInternational Conference on Computational Linguistics (COLING), 2024
Qian Tao
Wenyuan Yu
Jingren Zhou
MQ
146
11
0
17 Oct 2024
AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned
  Quantization
AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned Quantization
Yifan Tan
Haoze Wang
Chao Yan
Yangdong Deng
MQ
184
5
0
25 Sep 2024
Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free
  Manner
Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner
Yuzhang Shang
Bingxin Xu
Weitai Kang
Mu Cai
Yuheng Li
Zehao Wen
Zhen Dong
Kurt Keutzer
Yong Jae Lee
Yan Yan
228
11
0
19 Sep 2024
Art and Science of Quantizing Large-Scale Models: A Comprehensive
  Overview
Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview
Yanshu Wang
Tong Yang
Xiyan Liang
Guoan Wang
Hanning Lu
Xu Zhe
Yaoming Li
Li Weitao
MQ
257
5
0
18 Sep 2024
Palu: Compressing KV-Cache with Low-Rank Projection
Palu: Compressing KV-Cache with Low-Rank Projection
Chi-Chih Chang
Wei-Cheng Lin
Chien-Yu Lin
Chong-Yan Chen
Yu-Fang Hu
Pei-Shuo Wang
N. Huang
Luis Ceze
Kai-Chiang Wu
173
8
0
30 Jul 2024
Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache
  Consumption
Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption
Shi Luohe
Hongyi Zhang
Yao Yao
Z. Li
Zhao Hai
429
87
0
25 Jul 2024
D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models
D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models
Zhongwei Wan
Xinjian Wu
Yu Zhang
Yi Xin
Chaofan Tao
...
Xin Wang
Siqi Luo
Jing Xiong
Mi Zhang
Mi Zhang
342
5
0
18 Jun 2024
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero
  Overhead
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead
A. Zandieh
Majid Daliri
Insu Han
MQ
151
17
0
05 Jun 2024
I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit
  Large Language Models
I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models
Yan Chen
Yuan Cheng
Dawei Yang
Zhihang Yuan
Jiangyong Yu
Chen Xu
Sifan Zhou
MQ
277
15
0
28 May 2024
Challenges in Deploying Long-Context Transformers: A Theoretical Peak
  Performance Analysis
Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis
Yao Fu
161
36
0
14 May 2024
SKVQ: Sliding-window Key and Value Cache Quantization for Large Language
  Models
SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models
Haojie Duanmu
Zhihang Yuan
Xiuhong Li
Jiangfei Duan
Xingcheng Zhang
Dahua Lin
MQ
212
29
0
10 May 2024
Efficient LLM Inference with Kcache
Efficient LLM Inference with Kcache
Qiaozhi He
Zhihua Wu
RALM
179
1
0
28 Apr 2024
TriForce: Lossless Acceleration of Long Sequence Generation with
  Hierarchical Speculative Decoding
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
Hanshi Sun
Zhuoming Chen
Xinyu Yang
Yuandong Tian
Beidi Chen
285
83
0
18 Apr 2024
LLM Inference Unveiled: Survey and Roofline Model Insights
LLM Inference Unveiled: Survey and Roofline Model Insights
Zhihang Yuan
Yuzhang Shang
Yang Zhou
Zhen Dong
Zhe Zhou
...
Yong Jae Lee
Yan Yan
Beidi Chen
Guangyu Sun
Kurt Keutzer
503
141
0
26 Feb 2024
A Survey on Model Compression for Large Language Models
A Survey on Model Compression for Large Language ModelsTransactions of the Association for Computational Linguistics (TACL), 2023
Xunyu Zhu
Jian Li
Yong Liu
Can Ma
Weiping Wang
294
332
0
15 Aug 2023
1