v1v2 (latest)

Fast Inference from Transformers via Speculative Decoding

International Conference on Machine Learning (ICML), 2022

30 November 2022

Yaniv Leviathan

Matan Kalman

Yossi Matias

LRM

ArXiv (abs)PDF HTML HuggingFace (9 upvotes)

Papers citing "Fast Inference from Transformers via Speculative Decoding"

50 / 763 papers shown

Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

Xiaoxuan Liu

336

22 Apr 2024

SnapKV: LLM Knows What You are Looking for Before Generation

Tianle Cai

413

383

22 Apr 2024

A Survey on Efficient Inference for Large Language Models

...

Shengen Yan

420

174

22 Apr 2024

Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

Dongyan Zhao

188

18 Apr 2024

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

Yuandong Tian

369

18 Apr 2024

Language Model Cascades: Token-level uncertainty and beyond

Neha Gupta

Harikrishna Narasimhan

Sanjiv Kumar

464

15 Apr 2024

Improving Recall of Large Language Models: A Model Collaboration Approach for Relational Triple Extraction

Yanghua Xiao

210

15 Apr 2024

Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models

186

15 Apr 2024

Exploring and Improving Drafts in Blockwise Parallel Decoding

Sanjiv Kumar

281

14 Apr 2024

On Speculative Decoding for Multimodal Large Language Models

172

13 Apr 2024

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

Zbigniew T. Kalbarczyk

Tamer Basar

Ravishankar K. Iyer

227

12 Apr 2024

Reducing hallucination in structured outputs via Retrieval-Augmented Generation

Patrice Béchard

Orlando Marquez Ayala

LLMAG

245

119

12 Apr 2024

Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel DecodingNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

Jie Ou

Yueming Chen

Wenhong Tian

302

10 Apr 2024

CQIL: Inference Latency Optimization with Concurrent Computation of Quasi-Independent LayersAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

248

10 Apr 2024

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

262

08 Apr 2024

Training LLMs over Neurally Compressed Text

Jascha Narain Sohl-Dickstein

Noah Constant

207

04 Apr 2024

The Larger the Better? Improved LLM Code-Generation via Budget Reallocation

Yossi Adi

272

31 Mar 2024

SDSAT: Accelerating LLM Inference through Speculative Decoding with Semantic Adaptive Tokens

Chengbo Liu

Yong Zhu

124

27 Mar 2024

The Unreasonable Ineffectiveness of the Deeper Layers

428

158

26 Mar 2024

Multi-Level Explanations for Generative Language Models

...

Karthikeyan N. Ramamurthy

316

21 Mar 2024

Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks

Bo-Ru Lu

Mari Ostendorf

209

19 Mar 2024

Toward Sustainable GenAI using Generation Directives for Carbon-Friendly Large Language Model Inference

Baolin Li

Yankai Jiang

V. Gadepally

Devesh Tiwari

244

19 Mar 2024

MELTing point: Mobile Evaluation of Language Transformers

Stefanos Laskaridis

Kleomenis Katevas

Lorenzo Minto

Hamed Haddadi

301

19 Mar 2024

Recurrent Drafter for Fast Speculative Decoding in Large Language Models

Yi Wang

291

14 Mar 2024

Token Alignment via Character Matching for Subword CompletionAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Sujan Kumar Gonugondla

192

13 Mar 2024

Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs

Ben Athiwaratkun

Sujan Kumar Gonugondla

...

267

13 Mar 2024

CHAI: Clustered Head Attention for Efficient LLM InferenceInternational Conference on Machine Learning (ICML), 2024

Saurabh Agarwal

Shivaram Venkataraman

Dimitris Papailiopoulos

Carole-Jean Wu

257

12 Mar 2024

Rethinking Generative Large Language Model Evaluation for Semantic ComprehensionInternational Conference on Machine Learning (ICML), 2024

201

12 Mar 2024

Learning to Decode Collaboratively with Multiple Language Models

Bailin Wang

160

06 Mar 2024

Apollo: A Lightweight Multilingual Medical LLM towards Democratizing Medical AI to 6B People

Xiang Wan

Haizhou Li

Benyou Wang

LM&MA

298

06 Mar 2024

CoGenesis: A Framework Collaborating Large and Small Language Models for Secure Context-Aware Instruction Following

357

05 Mar 2024

SynCode: LLM Generation with Grammar Augmentation

294

03 Mar 2024

Accelerating Greedy Coordinate Gradient via Probe Sampling

322

02 Mar 2024

IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact

Chun Yuan

254

02 Mar 2024

Direct Alignment of Draft Model for Speculative Decoding with Chat-Fine-Tuned LLMs

276

29 Feb 2024

Retrieval-Augmented Generation for AI-Generated Content: A Survey

964

463

29 Feb 2024

CLLMs: Consistency Large Language Models

464

28 Feb 2024

On the Challenges and Opportunities in Generative AI

...

761

28 Feb 2024

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

298

26 Feb 2024

Investigating the Effectiveness of HyperTuning via Gisting

Jason Phang

296

26 Feb 2024

LLM Inference Unveiled: Survey and Roofline Model Insights

Zhihang Yuan

Yuzhang Shang

Yang Zhou

Zhen Dong

Zhe Zhou

...

Yong Jae Lee

Yan Yan

Beidi Chen

Guangyu Sun

Kurt Keutzer

629

149

26 Feb 2024

Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens

252

24 Feb 2024

RelayAttention for Efficient Large Language Model Serving with Long System Prompts

Lei Zhu

Xinjiang Wang

Wayne Zhang

Rynson W. H. Lau

230

22 Feb 2024

T-Stitch: Accelerating Sampling in Pre-Trained Diffusion Models with Trajectory Stitching

Zizheng Pan

Bohan Zhuang

De-An Huang

Jianfei Cai

232

21 Feb 2024

Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding

Weilin Zhao

Xu Han

Chaojun Xiao

Maosong Sun

276

21 Feb 2024

ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models

Chenyang Song

Xu Han

Zhengyan Zhang

...

Zhiyuan Liu

Maosong Sun

376

21 Feb 2024

ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel Decoding

217

21 Feb 2024

Purifying Large Language Models by Ensembling a Small Language Model

Yang Liu

218

19 Feb 2024

Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?

469

19 Feb 2024

Revisiting Knowledge Distillation for Autoregressive Language Models

Qihuang Zhong

Liang Ding

Li Shen

Juhua Liu

Bo Du

Dacheng Tao

KELM

309

19 Feb 2024