ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2408.11049
  4. Cited By
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding
v1v2v3v4v5 (latest)

MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

International Conference on Learning Representations (ICLR), 2024
20 August 2024
Jian Chen
Vashisth Tiwari
Ranajoy Sadhukhan
Zhuoming Chen
Jinyuan Shi
Ian En-Hsu Yen
Ian En-Hsu Yen
Avner May
Tianqi Chen
Beidi Chen
    LRM
ArXiv (abs)PDFHTMLHuggingFace (13 upvotes)Github (5233★)

Papers citing "MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding"

50 / 71 papers shown
SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification
SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification
Zhendong Tan
Xingjun Zhang
Chaoyi Hu
Junjie Peng
Kun Xia
LRM
182
0
0
02 Dec 2025
Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding
Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding
Yilong Zhao
Jiaming Tang
Kan Zhu
Zihao Ye
Chi-chih Chang
...
Mohamed S. Abdelfattah
Mingyu Gao
Baris Kasikci
Song Han
Ion Stoica
ReLMLRM
269
1
0
01 Dec 2025
Polybasic Speculative Decoding Through a Theoretical Perspective
Polybasic Speculative Decoding Through a Theoretical Perspective
Ruilin Wang
Huixia Li
Yuexiao Ma
Xiawu Zheng
Fei Chao
Xuefeng Xiao
Rongrong Ji
278
0
0
30 Oct 2025
CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs
CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs
Zhiyuan Ning
Jiawei Shao
Ruge Xu
Xinfei Guo
Jun Zhang
Chi Zhang
Xuelong Li
177
0
0
30 Oct 2025
TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs
TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs
Sibo Xiao
Jinyuan Fu
Zhongle Xie
Lidan Shou
AI4TS
238
0
0
17 Oct 2025
DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models
DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models
Jinbin Zhang
Nasib Ullah
Erik Schultheis
Rohit Babbar
212
1
0
11 Oct 2025
Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding
Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding
Ruanjun Li
Ziheng Liu
Yuanming Shi
Jiawei Shao
Chi Zhang
Xuelong Li
198
0
0
19 Sep 2025
SpecVLM: Fast Speculative Decoding in Vision-Language Models
SpecVLM: Fast Speculative Decoding in Vision-Language Models
Haiduo Huang
Fuwei Yang
Zhenhua Liu
Xuanwu Yin
Dong Li
Pengju Ren
E. Barsoum
MLLMVLM
294
3
0
15 Sep 2025
LongCat-Flash Technical Report
LongCat-Flash Technical Report
M-A-P Team
Bayan
Bei Li
Bingye Lei
Bo Wang
...
Rongxiang Weng
Ruichen Shao
Rumei Li
Shizhe Wu
Shuai Liang
MLLMMoEVLM
532
33
0
01 Sep 2025
ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute
ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute
Hao Wen
Yifan Su
Feifei Zhang
Yunxin Liu
Yunhao Liu
Y. Zhang
Yuanchun Li
ReLMLRM
226
26
0
30 Aug 2025
SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning
SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning
Yicheng Ji
Jun Zhang
Heming Xia
Jinpeng Chen
Lidan Shou
Gang Chen
Huan Li
VLM
298
13
0
22 Aug 2025
XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
Aditya Tomar
Coleman Hooper
M Lee
Haocheng Xi
Rishabh Tiwari
Wonjun Kang
Luca Manolache
Michael W. Mahoney
Kurt Keutzer
A. Gholami
MQ
289
2
0
14 Aug 2025
READER: Retrieval-Assisted Drafter for Efficient LLM Inference
READER: Retrieval-Assisted Drafter for Efficient LLM Inference
Maxim Divilkovskiy
Vitaly Malygin
Sergey Zlobin
Sultan Isali
Vasily Kalugin
Stanislav Ilyushin
Nuriza Aitassova
Yi Fei
Zeng Weidi
RALM
237
0
0
12 Aug 2025
OverFill: Two-Stage Models for Efficient Language Model Decoding
OverFill: Two-Stage Models for Efficient Language Model Decoding
Woojeong Kim
Junxiong Wang
Jing Nathan Yan
Mohamed S. Abdelfattah
Alexander M Rush
148
0
0
11 Aug 2025
Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions
Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions
Bangsheng Tang
Carl Chengyan Fu
Fei Kou
Grigory Sizov
Haoci Zhang
...
Vlad Mihailescu
Xingwen Guo
Yan Cui
Y. Hu
Yejin Lee
LRM
388
6
0
11 Aug 2025
R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning
R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning
Zhuokun Chen
Zeren Chen
Jiahao He
Lu Sheng
Zhuliang Yu
Jianfei Cai
Bohan Zhuang
LRM
518
4
0
23 Jul 2025
OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding
OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding
R. Ramakrishnan
Zhaocong Yuan
Shaojie Zhuo
Chen Feng
Yicheng Lin
Chenzheng Su
Xiaopeng Zhang
SyDa
425
1
0
03 Jul 2025
Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?
Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?
Adithya Bhaskar
Alexander Wettig
Tianyu Gao
Yihe Dong
Danqi Chen
307
9
0
20 Jun 2025
Kinetics: Rethinking Test-Time Scaling Laws
Kinetics: Rethinking Test-Time Scaling Laws
Ranajoy Sadhukhan
Zhuoming Chen
Haizhong Zheng
Yang Zhou
Emma Strubell
Beidi Chen
499
9
0
05 Jun 2025
Rectified Sparse Attention
Rectified Sparse Attention
Yutao Sun
Tianzhu Ye
Li Dong
Yuqing Xia
Jian Chen
Yizhao Gao
S. Cao
Jianyong Wang
Furu Wei
343
7
0
04 Jun 2025
Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design
Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design
Yudi Zhang
Weilin Zhao
Xu Han
Tiejun Zhao
Wang Xu
Hailong Cao
Conghui Zhu
MQ
427
2
0
28 May 2025
SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences
SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences
Jungyoub Cha
Hyunjong Kim
Sungzoon Cho
VLM
439
1
0
27 May 2025
Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs
Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Xuan Zhang
Cunxiao Du
Sicheng Yu
Jiawei Wu
Fengzhuo Zhang
Wei Gao
Qian Liu
297
2
0
25 May 2025
Automatic Task Detection and Heterogeneous LLM Speculative Decoding
Automatic Task Detection and Heterogeneous LLM Speculative Decoding
Danying Ge
Jianhua Gao
Qizhi Jiang
Yifei Feng
Weixing Ji
282
0
0
13 May 2025
Energy Considerations of Large Language Model Inference and Efficiency Optimizations
Energy Considerations of Large Language Model Inference and Efficiency OptimizationsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Jared Fernandez
Clara Na
Vashisth Tiwari
Yonatan Bisk
Sasha Luccioni
Emma Strubell
661
49
0
24 Apr 2025
SD$^2$: Self-Distilled Sparse Drafters
SD2^22: Self-Distilled Sparse Drafters
Mike Lasby
Nish Sinnadurai
Valavan Manohararajah
Sean Lie
Yani Andrew Ioannou
Vithursan Thangarasa
851
1
0
10 Apr 2025
SPIRe: Boosting LLM Inference Throughput with Speculative Decoding
SPIRe: Boosting LLM Inference Throughput with Speculative Decoding
Sanjit Neelam
Daniel Heinlein
Vaclav Cvicek
Akshay Mishra
Reiner Pope
LRM
206
0
0
08 Apr 2025
Cognitive Memory in Large Language Models
Cognitive Memory in Large Language Models
Lianlei Shan
Shixian Luo
Zezhou Zhu
Yu Yuan
Yong Wu
LLMAGKELM
1.3K
28
0
03 Apr 2025
ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts
ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts
E. Georganas
Dhiraj D. Kalamkar
Alexander Kozlov
A. Heinecke
MQ
1.0K
6
0
17 Mar 2025
AdaSpec: Adaptive Speculative Decoding for Fast, SLO-Aware Large Language Model Serving
AdaSpec: Adaptive Speculative Decoding for Fast, SLO-Aware Large Language Model Serving
Kaiyu Huang
Yu Wang
Zhubo Shi
Han Zou
Minchen Yu
Qingjiang Shi
LRM
391
10
0
07 Mar 2025
Speculative Decoding and Beyond: An In-Depth Survey of Techniques
Speculative Decoding and Beyond: An In-Depth Survey of Techniques
Y. Hu
Zining Liu
Zhenyuan Dong
Tianfan Peng
Bradley McDanel
Shanghang Zhang
973
0
0
27 Feb 2025
RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding
RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding
Guanzheng Chen
Qilong Feng
Jinjie Ni
Xin Li
Michael Shieh
RALM
523
8
0
27 Feb 2025
TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation
TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation
Tong Wu
Junzhe Shen
Zixia Jia
Yanjie Wang
Zilong Zheng
382
1
0
26 Feb 2025
LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification
LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification
Penghui Yang
Cunxiao Du
Fengzhuo Zhang
Haonan Wang
Tianyu Pang
Chao Du
Bo An
RALMMQ
385
2
0
24 Feb 2025
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
Rishabh Tiwari
Haocheng Xi
Aditya Tomar
Coleman Hooper
Sehoon Kim
Maxwell Horton
Mahyar Najibi
Michael W. Mahoney
Kemal Kurniawan
Amir Gholami
MQ
390
13
0
05 Feb 2025
Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies
Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies
Nadav Timor
Jonathan Mamou
Daniel Korat
Moshe Berchansky
Oren Pereg
Gaurav Jain
Roy Schwartz
Moshe Wasserblat
777
10
0
31 Jan 2025
AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding
AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding
Zikun Li
Zhuofu Chen
Yingyi Huang
Xupeng Miao
Zeyu Wang
...
Zhuoming Chen
Sean Lai
Xinhao Cheng
Xupeng Miao
Zhihao Jia
413
6
0
21 Jan 2025
Closer Look at Efficient Inference Methods: A Survey of Speculative
  Decoding
Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding
Hyun Ryu
Eric Kim
429
5
0
20 Nov 2024
SSSD: Simply-Scalable Speculative Decoding
SSSD: Simply-Scalable Speculative Decoding
Michele Marzollo
Jiawei Zhuang
Niklas Roemer
Lorenz K. Müller
Lukas Cavigelli
Lukas Cavigelli
LRM
488
2
0
08 Nov 2024
ParallelSpec: Parallel Drafter for Efficient Speculative Decoding
ParallelSpec: Parallel Drafter for Efficient Speculative Decoding
Zilin Xiao
Hongming Zhang
Tao Ge
Siru Ouyang
Vicente Ordonez
Dong Yu
351
17
0
08 Oct 2024
No Request Left Behind: Tackling Heterogeneity in Long-Context LLM Inference with Medha
No Request Left Behind: Tackling Heterogeneity in Long-Context LLM Inference with Medha
A. Agrawal
Haoran Qiu
Junda Chen
Íñigo Goiri
Chaojie Zhang
Rayyan Shahid
Ramachandran Ramjee
Alexey Tumanov
Esha Choukse
RALMLRM
694
0
0
25 Sep 2024
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via
  Dynamic Sparse Attention
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
Huiqiang Jiang
Yucheng Li
Chengruidong Zhang
Qianhui Wu
Xufang Luo
...
Amir H. Abdi
Dongsheng Li
Chin-Yew Lin
Yuqing Yang
L. Qiu
442
307
0
02 Jul 2024
TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving Goodput
TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving Goodput
Xiaoxuan Liu
Cade Daniel
Langxiang Hu
Woosuk Kwon
Zhuohan Li
...
Kaichao You
Alvin Cheung
Zhijie Deng
Ion Stoica
Hao Zhang
560
23
0
20 Jun 2024
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Jiaming Tang
Yilong Zhao
Kan Zhu
Guangxuan Xiao
Baris Kasikci
Song Han
489
296
0
16 Jun 2024
Loki: Low-Rank Keys for Efficient Sparse Attention
Loki: Low-Rank Keys for Efficient Sparse Attention
Prajwal Singhania
Siddharth Singh
Shwai He
Soheil Feizi
A. Bhatele
343
66
0
04 Jun 2024
SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
Kaixuan Huang
Xudong Guo
M. Y. Wang
608
50
0
30 May 2024
Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model Inference
Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model InferenceInternational Conference on Learning Representations (ICLR), 2024
Nadav Timor
Jonathan Mamou
Daniel Korat
Moshe Berchansky
Oren Pereg
Moshe Wasserblat
Tomer Galanti
Michal Gordon
David Harel
LRM
328
1
0
23 May 2024
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
Ramya Prabhu
Ajay Nayak
Jayashree Mohan
Ramachandran Ramjee
Ashish Panwar
VLM
521
90
0
07 May 2024
TriForce: Lossless Acceleration of Long Sequence Generation with
  Hierarchical Speculative Decoding
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
Hanshi Sun
Zhuoming Chen
Xinyu Yang
Yuandong Tian
Beidi Chen
431
97
0
18 Apr 2024
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
Amey Agrawal
Nitin Kedia
Ashish Panwar
Jayashree Mohan
Nipun Kwatra
Bhargav S. Gulavani
Alexey Tumanov
Ramachandran Ramjee
558
467
0
04 Mar 2024
12
Next
Page 1 of 2