ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.07858
31
0

Scaling Laws for Speculative Decoding

8 May 2025
Siyuan Yan
Mo Zhu
Guo-qing Jiang
Jianfei Wang
Jiaxing Chen
Wentai Zhang
Xiang Liao
Xiao Cui
Chen Zhang
Zhuoran Song
Ran Zhu
    LRM
ArXivPDFHTML
Abstract

The escalating demand for efficient decoding in large language models (LLMs) is particularly critical for reasoning-intensive architectures like OpenAI-o3 and DeepSeek-R1, which depend on extended chain-of-thought reasoning. This study investigates speculative decoding techniques through dense LLM architectures to establish foundational insights for accelerating reasoning tasks. While speculative decoding methods leveraging parallel draft-verification cycles have emerged as promising acceleration techniques, the scaling laws governing decoding efficiency remain under-explored compared to conventional backbone LLMs developed through Pretraining->SFT->RLHF training paradigms. In this work, we discover Log-linear Scaling Laws (Theorem 1.1, 1.2 and 1.3) governing draft model acceptance rate (or decoding speed) across three dimensions: pretraining token volume, draft model capacity, and decoding batch size. Building on these laws, we achieve Scylla, which coordinates multi-dimensional scaling for popular LLMs (Llama2/3, Qwen2.5). Empirical validation shows Scylla achieves 1.5-2.2 higher acceptance rate than EAGLE2 and 0.3 higher than EAGLE3 at temperature T = 0, with peak performance gains on summarization and QA tasks (Figure 2). Industrial inference engine deployments demonstrate 2X decoding throughput improvements over EAGLE2 (Table 5), validating the transformative potential of systematic scaling for efficient LLM inference. Code will be released later.

View on arXiv
@article{yan2025_2505.07858,
  title={ Scaling Laws for Speculative Decoding },
  author={ Siyuan Yan and Mo Zhu and Guo-qing Jiang and Jianfei Wang and Jiaxing Chen and Wentai Zhang and Xiang Liao and Xiao Cui and Chen Zhang and Zhuoran Song and Ran Zhu },
  journal={arXiv preprint arXiv:2505.07858},
  year={ 2025 }
}
Comments on this paper