v1v2 (latest)

Fast Inference from Transformers via Speculative Decoding

International Conference on Machine Learning (ICML), 2022

30 November 2022

Yaniv Leviathan

Matan Kalman

Yossi Matias

LRM

ArXiv (abs)PDF HTML HuggingFace (9 upvotes)

Papers citing "Fast Inference from Transformers via Speculative Decoding"

50 / 763 papers shown

Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding

318

19 Feb 2024

Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

388

19 Feb 2024

Speculative Streaming: Fast LLM Inference without Auxiliary Models

281

16 Feb 2024

Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs

306

16 Feb 2024

Chain-of-Thought Reasoning Without Prompting

Xuezhi Wang

Denny Zhou

ReLM LRM

618

205

15 Feb 2024

BitDelta: Your Fine-Tune May Only Be Worth One Bit

Song Han

Tianle Cai

269

15 Feb 2024

Accelerating Parallel Sampling of Diffusion Models

Fan Wang

387

15 Feb 2024

HiRE: High Recall Approximate Top-

k

Estimation for Efficient LLM Inference

Yashas Samaga

Varun Yerram

Chong You

Srinadh Bhojanapalli

Sanjiv Kumar

Prateek Jain

Praneeth Netrapalli

181

14 Feb 2024

Tandem Transformers for Inference Efficient LLMs

Sanjiv Kumar

198

13 Feb 2024

Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding

Jonathan Ragan-Kelley

William Brandon

330

07 Feb 2024

PaDeLLM-NER: Parallel Decoding in Large Language Models for Named Entity Recognition

457

07 Feb 2024

Online Cascade Learning for Efficient Inference over Streams

Lunyiu Nie

Zhimin Ding

Erdong Hu

Christopher M. Jermaine

Swarat Chaudhuri

317

07 Feb 2024

Linear-time Minimum Bayes Risk Decoding with Reference Aggregation

Jannis Vamvas

Rico Sennrich

311

06 Feb 2024

ReLU

^2

Wins: Discovering Efficient Activation Functions for Sparse LLMs

Zhengyan Zhang

Yixin Song

Guanghui Yu

Xu Han

Yankai Lin

Chaojun Xiao

Chenyang Song

Zhiyuan Liu

Zeyu Mi

Maosong Sun

248

06 Feb 2024

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao

Peiyi Wang

Runxin Xu

...

1.5K

3,856

05 Feb 2024

Decoding-time Realignment of Language ModelsInternational Conference on Machine Learning (ICML), 2024

Daniele Calandriello

Felipe Llinares-López

272

05 Feb 2024

A Survey on Transformer Compression

474

05 Feb 2024

DeAL: Decoding-time Alignment for Large Language Models

423

05 Feb 2024

GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding

...

Yang You

240

03 Feb 2024

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Yichao Fu

Peter Bailis

Ion Stoica

Hao Zhang

373

241

03 Feb 2024

Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward

291

02 Feb 2024

Decoding Speculative Decoding

Minghao Yan

Saurabh Agarwal

Shivaram Venkataraman

LRM

334

02 Feb 2024

EAGLE: Speculative Sampling Requires Rethinking Feature UncertaintyInternational Conference on Machine Learning (ICML), 2024

590

314

26 Jan 2024

Accelerating Retrieval-Augmented Language Model Serving with Speculation

260

25 Jan 2024

MambaByte: Token-free Selective State Space Model

311

24 Jan 2024

Eloquent: A More Robust Transmission Scheme for LLM Token Streaming

199

23 Jan 2024

BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language ModelsExpert systems with applications (ESWA), 2024

231

23 Jan 2024

Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated TextInternational Conference on Machine Learning (ICML), 2024

294

210

22 Jan 2024

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai

579

510

19 Jan 2024

Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native

...

212

17 Jan 2024

Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models

Leyang Cui

122

16 Jan 2024

Learned Best-Effort LLM Serving

Siddharth Jha

Coleman Hooper

Xiaoxuan Liu

Sehoon Kim

Kurt Keutzer

106

15 Jan 2024

JumpCoder: Go Beyond Autoregressive Coder via Online ModificationAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Jianling Sun

276

15 Jan 2024

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative DecodingAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Heming Xia

Zhe Yang

Qingxiu Dong

Peiyi Wang

Zhifang Sui

462

204

15 Jan 2024

APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding

Yuxiao Dong

183

12 Jan 2024

Multi-Candidate Speculative Decoding

235

12 Jan 2024

Distilling Vision-Language Models on Millions of VideosComputer Vision and Pattern Recognition (CVPR), 2024

...

279

11 Jan 2024

Pheme: Efficient and Conversational Speech Generation

193

05 Jan 2024

Training and Serving System of Foundation Models: A Comprehensive Survey

223

05 Jan 2024

IoT in the Era of Generative AI: Vision and ChallengesIEEE Internet Computing (IEEE Internet Comput.), 2024

Zhongwei Wan

Bhaskar Krishnamachari

263

03 Jan 2024

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

416

119

23 Dec 2023

Structure-Aware Path Inference for Neural Finite State Transducers

Weiting Tan

Chu-cheng Lin

Jason Eisner

152

21 Dec 2023

Cascade Speculative Drafting for Even Faster LLM Inference

Kevin Chen-Chuan Chang

Jie Huang

LRM

562

18 Dec 2023

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human AnnotationsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Peiyi Wang

Lei Li

Zhihong Shao

R. X. Xu

Zhifang Sui

443

667

14 Dec 2023

LLM in a flash: Efficient Large Language Model Inference with Limited MemoryAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Keivan Alizadeh-Vahid

Minsik Cho

271

194

12 Dec 2023

A Review of Hybrid and Ensemble in Deep Learning for Natural Language Processing

157

09 Dec 2023

Stateful Large Language Model Serving with PensieveEuropean Conference on Computer Systems (EuroSys), 2023

Lingfan Yu

Jinyang Li

RALM KELM LLMAG

273

09 Dec 2023

Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML ServingSymposium on Operating Systems Principles (SOSP), 2023

163

08 Dec 2023

EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D ParallelismInternational Conference on Machine Learning (ICML), 2023

Jingren Zhou

486

08 Dec 2023

An LLM Compiler for Parallel Function Calling

Sehoon Kim

Suhong Moon

Ryan Tabrizi

Nicholas Lee

369

114

07 Dec 2023