Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2401.07851
Cited By
Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding
15 January 2024
Heming Xia
Zhe Yang
Qingxiu Dong
Peiyi Wang
Yongqi Li
Tao Ge
Tianyu Liu
Wenjie Li
Zhifang Sui
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding"
34 / 84 papers shown
Title
KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft Heads with Adversarial Learning
Kaiqi Zhang
Jing Zhao
Rui Chen
29
1
0
15 Aug 2024
CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding
Sophia Ho
Jinsol Park
Patrick Wang
19
0
0
08 Aug 2024
Clover-2: Accurate Inference for Regressive Lightweight Speculative Decoding
Bin Xiao
Lujun Gui
Lei Su
Weipeng Chen
18
2
0
01 Aug 2024
Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models
Georgy Tyukin
G. Dovonon
Jean Kaddour
Pasquale Minervini
LRM
23
0
0
22 Jul 2024
Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting
Zilong Wang
Zifeng Wang
Long Le
Huaixiu Steven Zheng
Swaroop Mishra
...
Anush Mattapalli
Ankur Taly
Jingbo Shang
Chen-Yu Lee
Tomas Pfister
RALM
70
30
0
11 Jul 2024
Merge, Ensemble, and Cooperate! A Survey on Collaborative Strategies in the Era of Large Language Models
Jinliang Lu
Ziliang Pang
Min Xiao
Yaochen Zhu
Rui Xia
Jiajun Zhang
MoMe
22
17
0
08 Jul 2024
S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models
Parsa Kavehzadeh
Mohammadreza Pourreza
Mojtaba Valipour
Tinashu Zhu
Haoli Bai
Ali Ghodsi
Boxing Chen
Mehdi Rezagholizadeh
27
0
0
02 Jul 2024
SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative Decoding
Zhenglin Wang
Jialong Wu
Yilong Lai
Congzhi Zhang
Deyu Zhou
LRM
ReLM
28
3
0
26 Jun 2024
Make Some Noise: Unlocking Language Model Parallel Inference Capability through Noisy Training
Yixuan Wang
Xianzhen Luo
Fuxuan Wei
Yijun Liu
Qingfu Zhu
Xuanyu Zhang
Qing Yang
Dongliang Xu
Wanxiang Che
35
3
0
25 Jun 2024
OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure
Jikai Wang
Yi Su
Juntao Li
Qingrong Xia
Zi Ye
Xinyu Duan
Zhefeng Wang
Min Zhang
29
11
0
25 Jun 2024
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
Yuhui Li
Fangyun Wei
Chao Zhang
Hongyang R. Zhang
75
1
0
24 Jun 2024
Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters
Euiin Yi
Taehyeon Kim
Hongseok Jeung
Du-Seong Chang
Se-Young Yun
38
4
0
24 Jun 2024
Cascade Reward Sampling for Efficient Decoding-Time Alignment
Bolian Li
Yifan Wang
A. Grama
Ruqi Zhang
Ruqi Zhang
AI4TS
44
8
0
24 Jun 2024
Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding
Kaiyan Zhang
Jianyu Wang
Ning Ding
Biqing Qi
Ermo Hua
Xingtai Lv
Bowen Zhou
28
7
0
18 Jun 2024
Optimized Speculative Sampling for GPU Hardware Accelerators
Dominik Wagner
Seanie Lee
Ilja Baumann
Philipp Seeberger
K. Riedhammer
Tobias Bocklet
28
3
0
16 Jun 2024
SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
Ruslan Svirschevski
Avner May
Zhuoming Chen
Beidi Chen
Zhihao Jia
Max Ryabinin
18
12
0
04 Jun 2024
S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs
Wei Zhong
Manasa Bharadwaj
31
5
0
30 May 2024
SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
Kaixuan Huang
Xudong Guo
Mengdi Wang
27
16
0
30 May 2024
Faster Cascades via Speculative Decoding
Harikrishna Narasimhan
Wittawat Jitkrittum
A. S. Rawat
Seungyeon Kim
Neha Gupta
A. Menon
Sanjiv Kumar
LRM
36
6
0
29 May 2024
SED: Self-Evaluation Decoding Enhances Large Language Models for Better Generation
Ziqin Luo
Haixia Han
Haokun Zhao
Guochao Jiang
Chengyu Du
Tingyun Li
Jiaqing Liang
Deqing Yang
Yanghua Xiao
38
3
0
26 May 2024
Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting
Fangcheng Liu
Yehui Tang
Zhenhua Liu
Yunsheng Ni
Kai Han
Yunhe Wang
30
23
0
29 Apr 2024
A Survey on Efficient Inference for Large Language Models
Zixuan Zhou
Xuefei Ning
Ke Hong
Tianyu Fu
Jiaming Xu
...
Shengen Yan
Guohao Dai
Xiao-Ping Zhang
Yuhan Dong
Yu-Xiang Wang
46
78
0
22 Apr 2024
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
Hanshi Sun
Zhuoming Chen
Xinyu Yang
Yuandong Tian
Beidi Chen
33
46
0
18 Apr 2024
CoGenesis: A Framework Collaborating Large and Small Language Models for Secure Context-Aware Instruction Following
Kaiyan Zhang
Jianyu Wang
Ermo Hua
Biqing Qi
Ning Ding
Bowen Zhou
SyDa
17
20
0
05 Mar 2024
Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens
Ziqian Zeng
Jiahong Yu
Qianshi Pang
Zihao W. Wang
Huiping Zhuang
Cen Chen
Xiaofeng Zou
18
4
0
24 Feb 2024
Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding
Weilin Zhao
Yuxiang Huang
Xu Han
Wang Xu
Chaojun Xiao
Xinrong Zhang
Yewei Fang
Kaihuo Zhang
Zhiyuan Liu
Maosong Sun
35
10
0
21 Feb 2024
Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?
Marco Gaido
Sara Papi
Matteo Negri
L. Bentivogli
33
11
0
19 Feb 2024
Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding
Zack Ankner
Rishab Parthasarathy
Aniruddha Nrusimha
Christopher Rinard
Jonathan Ragan-Kelley
William Brandon
6
24
0
07 Feb 2024
Nevermind: Instruction Override and Moderation in Large Language Models
Edward Kim
ALM
16
0
0
05 Feb 2024
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao
Peiyi Wang
Qihao Zhu
Runxin Xu
Jun-Mei Song
...
Haowei Zhang
Mingchuan Zhang
Y. K. Li
Yu-Huan Wu
Daya Guo
ReLM
LRM
26
620
0
05 Feb 2024
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
Yichao Fu
Peter Bailis
Ion Stoica
Hao Zhang
120
134
0
03 Feb 2024
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
Pratyush Maini
Skyler Seto
Richard He Bai
David Grangier
Yizhe Zhang
Navdeep Jaitly
SyDa
20
54
0
29 Jan 2024
ngram-OAXE: Phrase-Based Order-Agnostic Cross Entropy for Non-Autoregressive Machine Translation
Cunxiao Du
Zhaopeng Tu
Longyue Wang
Jing Jiang
27
10
0
08 Oct 2022
Lossless Acceleration for Seq2seq Generation with Aggressive Decoding
Tao Ge
Heming Xia
Xin Sun
Si-Qing Chen
Furu Wei
82
18
0
20 May 2022
Previous
1
2