MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

20 August 2024

Vashisth Tiwari

Ranajoy Sadhukhan

Ian En-Hsu Yen

Avner May

Tianqi Chen

Beidi Chen

Papers citing "MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding"

18 / 18 papers shown

Title
Automatic Task Detection and Heterogeneous LLM Speculative Decoding Danying Ge Jianhua Gao Qizhi Jiang Yifei Feng Weixing Ji 34 0 0 13 May 2025
Energy Considerations of Large Language Model Inference and Efficiency Optimizations Jared Fernandez Clara Na Vashisth Tiwari Yonatan Bisk Sasha Luccioni Emma Strubell 41 0 0 24 Apr 2025
SD $^2$ : Self-Distilled Sparse Drafters Mike Lasby Nish Sinnadurai Valavan Manohararajah Sean Lie Vithursan Thangarasa 143 1 0 10 Apr 2025
SPIRe: Boosting LLM Inference Throughput with Speculative Decoding Sanjit Neelam Daniel Heinlein Vaclav Cvicek Akshay Mishra Reiner Pope LRM 38 0 0 08 Apr 2025
Cognitive Memory in Large Language Models Lianlei Shan Shixian Luo Zezhou Zhu Yu Yuan Yong Wu LLMAG KELM 160 1 0 03 Apr 2025
ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts E. Georganas Dhiraj D. Kalamkar Alexander Kozlov A. Heinecke MQ 131 0 0 17 Mar 2025
SpecServe: Efficient and SLO-Aware Large Language Model Serving with Adaptive Speculative Decoding Kaiyu Huang Yu Wang Zhubo Shi Han Zou Minchen Yu Qingjiang Shi LRM 36 1 0 07 Mar 2025
Speculative Decoding and Beyond: An In-Depth Survey of Techniques Y. Hu Zining Liu Zhenyuan Dong Tianfan Peng Bradley McDanel S. Zhang 93 0 0 27 Feb 2025
From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens Tong Wu Junzhe Shen Zixia Jia Yunhong Wang Zilong Zheng 85 0 0 26 Feb 2025
LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification Penghui Yang Cunxiao Du Fengzhuo Zhang Haonan Wang Tianyu Pang Chao Du Bo An RALM 45 0 0 24 Feb 2025
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache Rishabh Tiwari Haocheng Xi Aditya Tomar Coleman Hooper Sehoon Kim Maxwell Horton Mahyar Najibi Michael W. Mahoney Kemal Kurniawan Amir Gholami MQ 56 1 0 05 Feb 2025
Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies Nadav Timor Jonathan Mamou Daniel Korat Moshe Berchansky Oren Pereg Gaurav Jain Roy Schwartz Moshe Wasserblat David Harel 86 2 0 31 Jan 2025
AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding Zikun Li Zhuofu Chen Remi Delacourt Gabriele Oliaro Zeyu Wang ... Zhuoming Chen Sean Lai Xupeng Miao Xupeng Miao Zhihao Jia 53 6 0 21 Jan 2025
Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding Hyun Ryu Eric Kim 77 3 0 20 Nov 2024
SSSD: Simply-Scalable Speculative Decoding Michele Marzollo Jiawei Zhuang Niklas Roemer Lorenz K. Müller Lukas Cavigelli LRM 39 2 0 08 Nov 2024
ParallelSpec: Parallel Drafter for Efficient Speculative Decoding Zilin Xiao Hongming Zhang Tao Ge Siru Ouyang Vicente Ordonez Dong Yu 39 5 0 08 Oct 2024
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention Huiqiang Jiang Yucheng Li Chengruidong Zhang Qianhui Wu Xufang Luo ... Amir H. Abdi Dongsheng Li Chin-Yew Lin Yuqing Yang L. Qiu 72 83 0 02 Jul 2024
Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding Weilin Zhao Yuxiang Huang Xu Han Wang Xu Chaojun Xiao Xinrong Zhang Yewei Fang Kaihuo Zhang Zhiyuan Liu Maosong Sun 35 11 0 21 Feb 2024