QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache

5 February 2025

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)

Papers citing "QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache"

50 / 53 papers shown

Beyond Next-Token Prediction: A Performance Characterization of Diffusion versus Autoregressive Language Models

132

05 Oct 2025

Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding

148

19 Sep 2025

XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization

180

14 Aug 2025

$$\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts$

\texttt{SPECS}

: Faster Test-Time Scaling through Speculative Drafts

203

15 Jun 2025

Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design

357

28 May 2025

ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts

923

17 Mar 2025

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

...

517

03 Jan 2025

COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 TrainingInternational Conference on Learning Representations (ICLR), 2024

483

25 Oct 2024

QSpec: Speculative Decoding with Complementary Quantization Schemes

436

15 Oct 2024

KV Prediction for Improved Time to First Token

239

10 Oct 2024

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference AccelerationInternational Conference on Learning Representations (ICLR), 2024

Jun Zhu

Jianfei Chen

VLM MQ

683

03 Oct 2024

HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly

Peter Izsak

Danqi Chen

369

03 Oct 2024

INT-FlashAttention: Enabling Flash Attention for INT8 Quantization

Tong Yang

253

25 Sep 2024

MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative DecodingInternational Conference on Learning Representations (ICLR), 2024

Avner May

Tianqi Chen

Beidi Chen

LRM

642

20 Aug 2024

Post-Training Sparse Attention with Double Sparsity

Shuo Yang

Ying Sheng

Joseph E. Gonzalez

Ion Stoica

Lianmin Zheng

285

11 Aug 2024

LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

Minsik Cho

320

19 Jul 2024

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

505

321

11 Jul 2024

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Chengruidong Zhang

...

328

225

02 Jul 2024

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

414

213

16 Jun 2024

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

Jonathan Ragan-Kelley

256

21 May 2024

SirLLM: Streaming Infinite Retentive LLM

244

21 May 2024

SnapKV: LLM Knows What You are Looking for Before Generation

Tianle Cai

398

377

22 Apr 2024

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

Yuandong Tian

356

18 Apr 2024

Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization

Haocheng Xi

Yuxiang Chen

Kang Zhao

Kaijun Zheng

Jianfei Chen

Jun Zhu

230

19 Mar 2024

Dynamic Memory Compression: Retrofitting LLMs for Accelerated InferenceInternational Conference on Machine Learning (ICML), 2024

296

14 Mar 2024

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

398

126

08 Mar 2024

$$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens$

\infty

Bench: Extending Long Context Evaluation Beyond 100K Tokens

Yingfa Chen

...

Xu Han

Zhen Leng Thai

Shuo Wang

Zhiyuan Liu

Maosong Sun

RALM LRM

539

273

21 Feb 2024

Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

372

19 Feb 2024

Speculative Streaming: Fast LLM Inference without Auxiliary Models

281

16 Feb 2024

Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs

299

16 Feb 2024

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

314

332

05 Feb 2024

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Coleman Hooper

Sehoon Kim

433

361

31 Jan 2024

EAGLE: Speculative Sampling Requires Rethinking Feature UncertaintyInternational Conference on Machine Learning (ICML), 2024

585

314

26 Jan 2024

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai

565

501

19 Jan 2024

FP8-LM: Training FP8 Large Language Models

...

300

27 Oct 2023

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMsInternational Conference on Learning Representations (ICLR), 2023

437

365

03 Oct 2023

Efficient Streaming Language Models with Attention SinksInternational Conference on Learning Representations (ICLR), 2023

Michel Lang

Yuandong Tian

Beidi Chen

Song Han

Mike Lewis

AI4TS RALM

439

1,239

29 Sep 2023

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language ModelsInternational Conference on Learning Representations (ICLR), 2023

Yu Qiao

Ping Luo

496

311

25 Aug 2023

QuIP: 2-Bit Quantization of Large Language Models With GuaranteesNeural Information Processing Systems (NeurIPS), 2023

331

306

25 Jul 2023

H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large
Language Models

_2

O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language ModelsNeural Information Processing Systems (NeurIPS), 2023

...

749

467

24 Jun 2023

SqueezeLLM: Dense-and-Sparse QuantizationInternational Conference on Machine Learning (ICML), 2023

Sehoon Kim

Coleman Hooper

Zhen Dong

460

256

13 Jun 2023

AWQ: Activation-aware Weight Quantization for LLM Compression and AccelerationConference on Machine Learning and Systems (MLSys), 2023

Chuang Gan

Song Han

EDL MQ

832

946

01 Jun 2023

Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test TimeNeural Information Processing Systems (NeurIPS), 2023

Anastasios Kyrillidis

Anshumali Shrivastava

316

311

26 May 2023

Full Stack Optimization of Transformer Inference: a Survey

Sehoon Kim

Coleman Hooper

...

287

150

27 Feb 2023

Speculative Decoding with Big Little DecoderNeural Information Processing Systems (NeurIPS), 2023

Sehoon Kim

Suhong Moon

418

162

15 Feb 2023

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen

Sebastian Borgeaud

G. Irving

Jean-Baptiste Lespiau

Laurent Sifre

J. Jumper

BDL LRM

324

654

02 Feb 2023

Fast Inference from Transformers via Speculative DecodingInternational Conference on Machine Learning (ICML), 2022

Yaniv Leviathan

Matan Kalman

Yossi Matias

LRM

594

1,133

30 Nov 2022

Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple GranularitiesNeural Information Processing Systems (NeurIPS), 2022

Kyle Lo

221

22 Jun 2022

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-AwarenessNeural Information Processing Systems (NeurIPS), 2022

843

3,308

27 May 2022

Transformer Acceleration with Dynamic Sparse Attention

178

21 Oct 2021