Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2502.10424
Cited By
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
5 February 2025
Rishabh Tiwari
Haocheng Xi
Aditya Tomar
Coleman Hooper
Sehoon Kim
Maxwell Horton
Mahyar Najibi
Michael W. Mahoney
Kemal Kurniawan
Amir Gholami
MQ
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (1 upvotes)
Papers citing
"QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache"
50 / 53 papers shown
Beyond Next-Token Prediction: A Performance Characterization of Diffusion versus Autoregressive Language Models
Minseo Kim
Coleman Hooper
Aditya Tomar
Chenfeng Xu
Mehrdad Farajtabar
Michael W. Mahoney
Kurt Keutzer
Amir Gholami
132
2
0
05 Oct 2025
Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding
Ruanjun Li
Ziheng Liu
Yuanming Shi
Jiawei Shao
Chi Zhang
Xuelong Li
148
0
0
19 Sep 2025
XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
Aditya Tomar
Coleman Hooper
M Lee
Haocheng Xi
Rishabh Tiwari
Wonjun Kang
Luca Manolache
Michael W. Mahoney
Kurt Keutzer
A. Gholami
MQ
180
0
0
14 Aug 2025
SPECS
\texttt{SPECS}
SPECS
: Faster Test-Time Scaling through Speculative Drafts
Mert Cemri
Nived Rajaraman
Rishabh Tiwari
Xiaoxuan Liu
Kurt Keutzer
Ion Stoica
Kannan Ramchandran
Ahmad Beirami
Ziteng Sun
LRM
203
2
0
15 Jun 2025
Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design
Yudi Zhang
Weilin Zhao
Xu Han
Tiejun Zhao
Wang Xu
Hailong Cao
Conghui Zhu
MQ
357
1
0
28 May 2025
ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts
E. Georganas
Dhiraj D. Kalamkar
Alexander Kozlov
A. Heinecke
MQ
923
4
0
17 Mar 2025
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
Di Liu
Meng Chen
Baotong Lu
Huiqiang Jiang
Zhenhua Han
...
Jianchao Tan
Chong Chen
Fan Yang
Yue Yang
Lili Qiu
517
80
0
03 Jan 2025
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
International Conference on Learning Representations (ICLR), 2024
Haocheng Xi
Han Cai
Ligeng Zhu
Yaojie Lu
Kurt Keutzer
Jianfei Chen
Song Han
MQ
483
17
0
25 Oct 2024
QSpec: Speculative Decoding with Complementary Quantization Schemes
Juntao Zhao
Wenhao Lu
Sheng Wang
Lingpeng Kong
Chuan Wu
MQ
436
11
0
15 Oct 2024
KV Prediction for Improved Time to First Token
Maxwell Horton
Qingqing Cao
Chenfan Sun
Yanzi Jin
Sachin Mehta
Mohammad Rastegari
Moin Nabi
AI4TS
239
7
0
10 Oct 2024
SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration
International Conference on Learning Representations (ICLR), 2024
Jintao Zhang
Jia Wei
Pengle Zhang
Jun-Jie Zhu
Jun Zhu
Jianfei Chen
VLM
MQ
683
83
0
03 Oct 2024
HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly
Howard Yen
Tianyu Gao
Minmin Hou
Ke Ding
Daniel Fleischer
Peter Izsak
Moshe Wasserblat
Danqi Chen
ALM
ELM
369
71
0
03 Oct 2024
INT-FlashAttention: Enabling Flash Attention for INT8 Quantization
Shimao Chen
Zirui Liu
Zhiying Wu
Ce Zheng
Peizhuang Cong
Zihan Jiang
Yuhan Wu
Lei Su
Tong Yang
MQ
VLM
253
5
0
25 Sep 2024
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding
International Conference on Learning Representations (ICLR), 2024
Jian Chen
Vashisth Tiwari
Ranajoy Sadhukhan
Zhuoming Chen
Jinyuan Shi
Ian En-Hsu Yen
Ian En-Hsu Yen
Avner May
Tianqi Chen
Beidi Chen
LRM
642
61
0
20 Aug 2024
Post-Training Sparse Attention with Double Sparsity
Shuo Yang
Ying Sheng
Joseph E. Gonzalez
Ion Stoica
Lianmin Zheng
285
25
0
11 Aug 2024
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
Qichen Fu
Minsik Cho
Thomas Merth
Sachin Mehta
Mohammad Rastegari
Mahyar Najibi
320
58
0
19 Jul 2024
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
Jay Shah
Ganesh Bikshandi
Ying Zhang
Vijay Thakkar
Pradeep Ramani
Tri Dao
505
321
0
11 Jul 2024
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
Huiqiang Jiang
Yucheng Li
Chengruidong Zhang
Qianhui Wu
Xufang Luo
...
Amir H. Abdi
Dongsheng Li
Chin-Yew Lin
Yuqing Yang
L. Qiu
328
225
0
02 Jul 2024
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Jiaming Tang
Yilong Zhao
Kan Zhu
Guangxuan Xiao
Baris Kasikci
Song Han
414
213
0
16 Jun 2024
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
William Brandon
Mayank Mishra
Aniruddha Nrusimha
Yikang Shen
Jonathan Ragan-Kelley
MQ
256
86
0
21 May 2024
SirLLM: Streaming Infinite Retentive LLM
Yao Yao
Z. Li
Hai Zhao
KELM
RALM
244
18
0
21 May 2024
SnapKV: LLM Knows What You are Looking for Before Generation
Yuhong Li
Yingbing Huang
Bowen Yang
Bharat Venkitesh
Acyr Locatelli
Hanchen Ye
Tianle Cai
Patrick Lewis
Deming Chen
VLM
398
377
0
22 Apr 2024
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
Hanshi Sun
Zhuoming Chen
Xinyu Yang
Yuandong Tian
Beidi Chen
356
83
0
18 Apr 2024
Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization
Haocheng Xi
Yuxiang Chen
Kang Zhao
Kaijun Zheng
Jianfei Chen
Jun Zhu
MQ
230
29
0
19 Mar 2024
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
International Conference on Machine Learning (ICML), 2024
Piotr Nawrot
Adrian Lañcucki
Marcin Chochowski
David Tarjan
Edoardo Ponti
296
91
0
14 Mar 2024
GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM
Hao Kang
Qingru Zhang
Souvik Kundu
Geonhwa Jeong
Zaoxing Liu
Tushar Krishna
Tuo Zhao
MQ
398
126
0
08 Mar 2024
∞
\infty
∞
Bench: Extending Long Context Evaluation Beyond 100K Tokens
Xinrong Zhang
Yingfa Chen
Shengding Hu
Zihang Xu
Junhao Chen
...
Xu Han
Zhen Leng Thai
Shuo Wang
Zhiyuan Liu
Maosong Sun
RALM
LRM
539
273
0
21 Feb 2024
Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding
Zhuoming Chen
Avner May
Ruslan Svirschevski
Yuhsun Huang
Max Ryabinin
Zhihao Jia
Beidi Chen
372
68
0
19 Feb 2024
Speculative Streaming: Fast LLM Inference without Auxiliary Models
Nikhil Bhendawade
Irina Belousova
Qichen Fu
Henry Mason
Mohammad Rastegari
Mahyar Najibi
LRM
281
38
0
16 Feb 2024
Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
Yeonhong Park
Jake Hyun
SangLyul Cho
Bonggeun Sim
Jae W. Lee
MQ
299
39
0
16 Feb 2024
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Zirui Liu
Jiayi Yuan
Hongye Jin
Shaochen Zhong
Zhaozhuo Xu
Vladimir Braverman
Beidi Chen
Helen Zhou
MQ
314
332
0
05 Feb 2024
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
Coleman Hooper
Sehoon Kim
Hiva Mohammadzadeh
Michael W. Mahoney
Y. Shao
Kurt Keutzer
A. Gholami
MQ
433
361
0
31 Jan 2024
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
International Conference on Machine Learning (ICML), 2024
Yuhui Li
Fangyun Wei
Chao Zhang
Hongyang R. Zhang
585
314
0
26 Jan 2024
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Tianle Cai
Yuhong Li
Zhengyang Geng
Hongwu Peng
Jason D. Lee
De-huai Chen
Tri Dao
565
501
0
19 Jan 2024
FP8-LM: Training FP8 Large Language Models
Houwen Peng
Kan Wu
Yixuan Wei
Guoshuai Zhao
Yuxiang Yang
...
Zheng Zhang
Shuguang Liu
Joe Chau
Han Hu
Jun Zhou
MQ
300
63
0
27 Oct 2023
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
International Conference on Learning Representations (ICLR), 2023
Suyu Ge
Yunan Zhang
Liyuan Liu
Minjia Zhang
Jiawei Han
Jianfeng Gao
437
365
0
03 Oct 2023
Efficient Streaming Language Models with Attention Sinks
International Conference on Learning Representations (ICLR), 2023
Michel Lang
Yuandong Tian
Beidi Chen
Song Han
Mike Lewis
AI4TS
RALM
439
1,239
0
29 Sep 2023
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
International Conference on Learning Representations (ICLR), 2023
Wenqi Shao
Mengzhao Chen
Zhaoyang Zhang
Peng Xu
Lirui Zhao
Zhiqiang Li
Kaipeng Zhang
Shiyang Feng
Yu Qiao
Ping Luo
MQ
496
311
0
25 Aug 2023
QuIP: 2-Bit Quantization of Large Language Models With Guarantees
Neural Information Processing Systems (NeurIPS), 2023
Jerry Chee
Yaohui Cai
Volodymyr Kuleshov
Chris De Sa
MQ
331
306
0
25 Jul 2023
H
2
_2
2
O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
Neural Information Processing Systems (NeurIPS), 2023
Zhenyu Zhang
Ying Sheng
Wanrong Zhu
Tianlong Chen
Lianmin Zheng
...
Yuandong Tian
Christopher Ré
Clark W. Barrett
Zinan Lin
Beidi Chen
VLM
749
467
0
24 Jun 2023
SqueezeLLM: Dense-and-Sparse Quantization
International Conference on Machine Learning (ICML), 2023
Sehoon Kim
Coleman Hooper
A. Gholami
Zhen Dong
Xiuyu Li
Sheng Shen
Michael W. Mahoney
Kurt Keutzer
MQ
460
256
0
13 Jun 2023
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Conference on Machine Learning and Systems (MLSys), 2023
Ji Lin
Jiaming Tang
Haotian Tang
Shang Yang
Wei-Ming Chen
Wei-Chen Wang
Guangxuan Xiao
Xingyu Dang
Chuang Gan
Song Han
EDL
MQ
832
946
0
01 Jun 2023
Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time
Neural Information Processing Systems (NeurIPS), 2023
Zichang Liu
Aditya Desai
Fangshuo Liao
Weitao Wang
Victor Xie
Zhaozhuo Xu
Anastasios Kyrillidis
Anshumali Shrivastava
316
311
0
26 May 2023
Full Stack Optimization of Transformer Inference: a Survey
Sehoon Kim
Coleman Hooper
Thanakul Wattanawong
Minwoo Kang
Ruohan Yan
...
Qijing Huang
Kurt Keutzer
Michael W. Mahoney
Y. Shao
A. Gholami
MQ
287
150
0
27 Feb 2023
Speculative Decoding with Big Little Decoder
Neural Information Processing Systems (NeurIPS), 2023
Sehoon Kim
K. Mangalam
Suhong Moon
Jitendra Malik
Michael W. Mahoney
A. Gholami
Kurt Keutzer
MoE
418
162
0
15 Feb 2023
Accelerating Large Language Model Decoding with Speculative Sampling
Charlie Chen
Sebastian Borgeaud
G. Irving
Jean-Baptiste Lespiau
Laurent Sifre
J. Jumper
BDL
LRM
324
654
0
02 Feb 2023
Fast Inference from Transformers via Speculative Decoding
International Conference on Machine Learning (ICML), 2022
Yaniv Leviathan
Matan Kalman
Yossi Matias
LRM
594
1,133
0
30 Nov 2022
Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities
Neural Information Processing Systems (NeurIPS), 2022
Zejiang Shen
Kyle Lo
L. Yu
N. Dahlberg
Margo Schlanger
Doug Downey
ELM
AILaw
221
61
0
22 Jun 2022
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Neural Information Processing Systems (NeurIPS), 2022
Tri Dao
Daniel Y. Fu
Stefano Ermon
Atri Rudra
Christopher Ré
VLM
843
3,308
0
27 May 2022
Transformer Acceleration with Dynamic Sparse Attention
Liu Liu
Zheng Qu
Zhaodong Chen
Yufei Ding
Yuan Xie
178
28
0
21 Oct 2021
1
2
Next