Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2211.17192
Cited By
v1
v2 (latest)
Fast Inference from Transformers via Speculative Decoding
International Conference on Machine Learning (ICML), 2022
30 November 2022
Yaniv Leviathan
Matan Kalman
Yossi Matias
LRM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (9 upvotes)
Papers citing
"Fast Inference from Transformers via Speculative Decoding"
50 / 763 papers shown
Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding
Hanling Yi
Feng-Huei Lin
Hongbin Li
Peiyang Ning
Xiaotian Yu
Rong Xiao
LRM
318
21
0
19 Feb 2024
Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding
Zhuoming Chen
Avner May
Ruslan Svirschevski
Yuhsun Huang
Max Ryabinin
Zhihao Jia
Beidi Chen
388
70
0
19 Feb 2024
Speculative Streaming: Fast LLM Inference without Auxiliary Models
Nikhil Bhendawade
Irina Belousova
Qichen Fu
Henry Mason
Mohammad Rastegari
Mahyar Najibi
LRM
281
38
0
16 Feb 2024
Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
Yeonhong Park
Jake Hyun
SangLyul Cho
Bonggeun Sim
Jae W. Lee
MQ
306
39
0
16 Feb 2024
Chain-of-Thought Reasoning Without Prompting
Xuezhi Wang
Denny Zhou
ReLM
LRM
618
205
0
15 Feb 2024
BitDelta: Your Fine-Tune May Only Be Worth One Bit
James Liu
Guangxuan Xiao
Kai Li
Jason D. Lee
Song Han
Tri Dao
Tianle Cai
269
37
0
15 Feb 2024
Accelerating Parallel Sampling of Diffusion Models
Zhiwei Tang
Jiasheng Tang
Hao Luo
Fan Wang
Tsung-Hui Chang
387
25
0
15 Feb 2024
HiRE: High Recall Approximate Top-
k
k
k
Estimation for Efficient LLM Inference
Yashas Samaga
Varun Yerram
Chong You
Srinadh Bhojanapalli
Sanjiv Kumar
Prateek Jain
Praneeth Netrapalli
181
7
0
14 Feb 2024
Tandem Transformers for Inference Efficient LLMs
S. AishwaryaP
Pranav Ajit Nair
Yashas Samaga
Toby Boyd
Sanjiv Kumar
Prateek Jain
Praneeth Netrapalli
198
10
0
13 Feb 2024
Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding
Zack Ankner
Rishab Parthasarathy
Aniruddha Nrusimha
Christopher Rinard
Jonathan Ragan-Kelley
William Brandon
330
65
0
07 Feb 2024
PaDeLLM-NER: Parallel Decoding in Large Language Models for Named Entity Recognition
Jinghui Lu
Ziwei Yang
Yanjie Wang
Xuejing Liu
Brian Mac Namee
Can Huang
MoE
457
12
0
07 Feb 2024
Online Cascade Learning for Efficient Inference over Streams
Lunyiu Nie
Zhimin Ding
Erdong Hu
Christopher M. Jermaine
Swarat Chaudhuri
317
15
0
07 Feb 2024
Linear-time Minimum Bayes Risk Decoding with Reference Aggregation
Jannis Vamvas
Rico Sennrich
311
24
0
06 Feb 2024
ReLU
2
^2
2
Wins: Discovering Efficient Activation Functions for Sparse LLMs
Zhengyan Zhang
Yixin Song
Guanghui Yu
Xu Han
Yankai Lin
Chaojun Xiao
Chenyang Song
Zhiyuan Liu
Zeyu Mi
Maosong Sun
248
46
0
06 Feb 2024
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao
Peiyi Wang
Qihao Zhu
Runxin Xu
Jun-Mei Song
...
Haowei Zhang
Mingchuan Zhang
Yiming Li
Yu-Huan Wu
Daya Guo
ReLM
LRM
1.5K
3,856
0
05 Feb 2024
Decoding-time Realignment of Language Models
International Conference on Machine Learning (ICML), 2024
Tianlin Liu
Shangmin Guo
Leonardo Bianco
Daniele Calandriello
Quentin Berthet
Felipe Llinares-López
Jessica Hoffmann
Lucas Dixon
Michal Valko
Mathieu Blondel
AI4CE
272
57
0
05 Feb 2024
A Survey on Transformer Compression
Yehui Tang
Yunhe Wang
Jianyuan Guo
Zhijun Tu
Kai Han
Hailin Hu
Dacheng Tao
474
66
0
05 Feb 2024
DeAL: Decoding-time Alignment for Large Language Models
James Y. Huang
Sailik Sengupta
Daniele Bonadiman
Yi-An Lai
Arshit Gupta
Nikolaos Pappas
Saab Mansour
Katrin Kirchoff
Dan Roth
423
44
0
05 Feb 2024
GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding
Cunxiao Du
Jing Jiang
Yuanchen Xu
Jiawei Wu
Sicheng Yu
...
Shenggui Li
Kai Xu
Liqiang Nie
Zhaopeng Tu
Yang You
240
61
0
03 Feb 2024
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
Yichao Fu
Peter Bailis
Ion Stoica
Hao Zhang
373
241
0
03 Feb 2024
Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward
Arnav Chavan
Raghav Magazine
Shubham Kushwaha
M. Debbah
Deepak Gupta
291
37
0
02 Feb 2024
Decoding Speculative Decoding
Minghao Yan
Saurabh Agarwal
Shivaram Venkataraman
LRM
334
24
0
02 Feb 2024
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
International Conference on Machine Learning (ICML), 2024
Yuhui Li
Fangyun Wei
Chao Zhang
Hongyang R. Zhang
590
314
0
26 Jan 2024
Accelerating Retrieval-Augmented Language Model Serving with Speculation
Zhihao Zhang
Alan Zhu
Lijie Yang
Yihua Xu
Lanting Li
P. Phothilimthana
Zhihao Jia
RALM
KELM
260
21
0
25 Jan 2024
MambaByte: Token-free Selective State Space Model
Junxiong Wang
Tushaar Gangavarapu
Jing Nathan Yan
Alexander M. Rush
Mamba
311
54
0
24 Jan 2024
Eloquent: A More Robust Transmission Scheme for LLM Token Streaming
Hanchen Li
Yuhan Liu
Yihua Cheng
Siddhant Ray
Kuntai Du
Junchen Jiang
199
6
0
23 Jan 2024
BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models
Expert systems with applications (ESWA), 2024
Feng-Huei Lin
Hanling Yi
Hongbin Li
Yifan Yang
Xiaotian Yu
Guangming Lu
Rong Xiao
231
10
0
23 Jan 2024
Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text
International Conference on Machine Learning (ICML), 2024
Abhimanyu Hans
Avi Schwarzschild
Valeriia Cherepanova
Hamid Kazemi
Aniruddha Saha
Micah Goldblum
Jonas Geiping
Tom Goldstein
DeLMO
294
210
0
22 Jan 2024
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Tianle Cai
Yuhong Li
Zhengyang Geng
Hongwu Peng
Jason D. Lee
De-huai Chen
Tri Dao
579
510
0
19 Jan 2024
Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native
Yao Lu
Song Bian
Lequn Chen
Yongjun He
Yulong Hui
...
Huanchen Zhang
Minjia Zhang
Qizhen Zhang
Tianyi Zhou
Danyang Zhuo
212
13
0
17 Jan 2024
Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models
Shuming Shi
Enbo Zhao
Deng Cai
Leyang Cui
Xinting Huang
Huayang Li
122
4
0
16 Jan 2024
Learned Best-Effort LLM Serving
Siddharth Jha
Coleman Hooper
Xiaoxuan Liu
Sehoon Kim
Kurt Keutzer
106
4
0
15 Jan 2024
JumpCoder: Go Beyond Autoregressive Coder via Online Modification
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Mouxiang Chen
Hao Tian
Zhongxi Liu
Xiaoxue Ren
Jianling Sun
SyDa
KELM
276
8
0
15 Jan 2024
Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Heming Xia
Zhe Yang
Qingxiu Dong
Peiyi Wang
Chak Tou Leong
Tao Ge
Tianyu Liu
Wenjie Li
Zhifang Sui
LRM
462
204
0
15 Jan 2024
APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding
Mingdao Liu
Aohan Zeng
Bowen Wang
Peng Zhang
Jie Tang
Yuxiao Dong
183
19
0
12 Jan 2024
Multi-Candidate Speculative Decoding
Sen Yang
Shujian Huang
Xinyu Dai
Jiajun Chen
BDL
235
28
0
12 Jan 2024
Distilling Vision-Language Models on Millions of Videos
Computer Vision and Pattern Recognition (CVPR), 2024
Yue Zhao
Long Zhao
Xingyi Zhou
Jialin Wu
Chun-Te Chu
...
Hartwig Adam
Ting Liu
Boqing Gong
Philipp Krahenbuhl
Liangzhe Yuan
VLM
279
20
0
11 Jan 2024
Pheme: Efficient and Conversational Speech Generation
Paweł Budzianowski
Taras Sereda
Tomasz Cichy
Ivan Vulić
193
10
0
05 Jan 2024
Training and Serving System of Foundation Models: A Comprehensive Survey
Jiahang Zhou
Yanyu Chen
Zicong Hong
Wuhui Chen
Yue Yu
Tao Zhang
Hui Wang
Chuan-fu Zhang
Zibin Zheng
ALM
223
14
0
05 Jan 2024
IoT in the Era of Generative AI: Vision and Challenges
IEEE Internet Computing (IEEE Internet Comput.), 2024
Xin Wang
Zhongwei Wan
Arvin Hekmati
M. Zong
Samiul Alam
Mi Zhang
Bhaskar Krishnamachari
263
5
0
03 Jan 2024
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
Xupeng Miao
Xupeng Miao
Zhihao Zhang
Xinhao Cheng
Hongyi Jin
Tianqi Chen
Zhihao Jia
416
119
0
23 Dec 2023
Structure-Aware Path Inference for Neural Finite State Transducers
Weiting Tan
Chu-cheng Lin
Jason Eisner
152
0
0
21 Dec 2023
Cascade Speculative Drafting for Even Faster LLM Inference
Ziyi Chen
Xiaocong Yang
Jiacheng Lin
Chenkai Sun
Kevin Chen-Chuan Chang
Jie Huang
LRM
562
73
0
18 Dec 2023
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Peiyi Wang
Lei Li
Zhihong Shao
R. X. Xu
Damai Dai
Yifei Li
Deli Chen
Y.Wu
Zhifang Sui
AIMat
LRM
ALM
443
667
0
14 Dec 2023
LLM in a flash: Efficient Large Language Model Inference with Limited Memory
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Keivan Alizadeh-Vahid
Iman Mirzadeh
Dmitry Belenko
Karen Khatamifard
Minsik Cho
C. C. D. Mundo
Mohammad Rastegari
Mehrdad Farajtabar
271
194
0
12 Dec 2023
A Review of Hybrid and Ensemble in Deep Learning for Natural Language Processing
Jianguo Jia
Wen-Chieh Liang
Youzhi Liang
VLM
157
29
0
09 Dec 2023
Stateful Large Language Model Serving with Pensieve
European Conference on Computer Systems (EuroSys), 2023
Lingfan Yu
Jinyang Li
RALM
KELM
LLMAG
273
41
0
09 Dec 2023
Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving
Symposium on Operating Systems Principles (SOSP), 2023
Yinwei Dai
Rui Pan
Anand Iyer
Kai Li
Ravi Netravali
163
17
0
08 Dec 2023
EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism
International Conference on Machine Learning (ICML), 2023
Yanxi Chen
Xuchen Pan
Yaliang Li
Bolin Ding
Jingren Zhou
LRM
486
57
0
08 Dec 2023
An LLM Compiler for Parallel Function Calling
Sehoon Kim
Suhong Moon
Ryan Tabrizi
Nicholas Lee
Michael W. Mahoney
Kurt Keutzer
A. Gholami
LRM
369
114
0
07 Dec 2023
Previous
1
2
3
...
13
14
15
16
Next