QSpec: Speculative Decoding with Complementary Quantization Schemes

15 October 2024

Lingpeng Kong

Abstract

Quantization has been substantially adopted to accelerate inference and reduce memory consumption of large language models (LLMs). While activation-weight joint quantization speeds up the inference process through low-precision kernels, we demonstrate that it suffers severe performance degradation on multi-step reasoning tasks, rendering it ineffective. We propose a novel quantization paradigm called QSPEC, which seamlessly integrates two complementary quantization schemes for speculative decoding. Leveraging nearly cost-free execution switching, QSPEC drafts tokens with low-precision, fast activation-weight quantization, and verifies them with high-precision weight-only quantization, effectively combining the strengths of both quantization schemes. Compared to high-precision quantization methods, QSPEC empirically boosts token generation throughput by up to 1.64x without any quality compromise, distinguishing it from other low-precision quantization approaches. This enhancement is also consistent across various serving tasks, model sizes, quantization methods, and batch sizes. Compared to state-of-art speculative decoding methods, our approach reuses weights and the KV cache, avoiding extra memory overhead while achieving up to 1.55x speedup in batched serving with a high acceptance rate. Furthermore, QSPEC offers a plug-and-play advantage without requiring any training. We believe that QSPEC demonstrates unique strengths for future deployment of high-fidelity quantization schemes, particularly in memory-constrained scenarios (e.g., edge devices).

View on arXiv

@article{zhao2025_2410.11305,
  title={ QSpec: Speculative Decoding with Complementary Quantization Schemes },
  author={ Juntao Zhao and Wenhao Lu and Sheng Wang and Lingpeng Kong and Chuan Wu },
  journal={arXiv preprint arXiv:2410.11305},
  year={ 2025 }
}

Comments on this paper