v1v2 (latest)

Fast Inference from Transformers via Speculative Decoding

International Conference on Machine Learning (ICML), 2022

30 November 2022

Yaniv Leviathan

Matan Kalman

Yossi Matias

LRM

ArXiv (abs)PDF HTML HuggingFace (9 upvotes)

Papers citing "Fast Inference from Transformers via Speculative Decoding"

50 / 763 papers shown

Kraken: Inherently Parallel Transformers For Efficient Multi-Device InferenceNeural Information Processing Systems (NeurIPS), 2024

R. Prabhakar

Hengrui Zhang

D. Wentzlaff

294

14 Aug 2024

PEARL: Parallel Speculative Decoding with Adaptive Draft LengthInternational Conference on Learning Representations (ICLR), 2024

383

13 Aug 2024

Post-Training Sparse Attention with Double Sparsity

Shuo Yang

Ying Sheng

Joseph E. Gonzalez

Ion Stoica

Lianmin Zheng

296

11 Aug 2024

Efficiency Unleashed: Inference Acceleration for LLM-based Recommender Systems with Speculative DecodingAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2024

401

11 Aug 2024

Speculative Diffusion Decoding: Accelerating Language Generation through DiffusionNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

526

10 Aug 2024

Retrieval-augmented code completion for local projects using large language modelsExpert systems with applications (ESWA), 2024

Marko Hostnik

Marko Robnik-Sikonja

RALM

275

09 Aug 2024

CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding

Sophia Ho

Jinsol Park

Patrick Wang

211

08 Aug 2024

StructuredRAG: JSON Response Formatting with Large Language Models

Connor Shorten

Charles Pierse

Thomas Benjamin Smith

292

07 Aug 2024

Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations

Leo Donisch

Sigurd Schacht

Carsten Lanquillon

298

06 Aug 2024

Clover-2: Accurate Inference for Regressive Lightweight Speculative Decoding

195

01 Aug 2024

ThinK: Thinner Key Cache by Query-Driven PruningInternational Conference on Learning Representations (ICLR), 2024

533

30 Jul 2024

Inference acceleration for large language models using "stairs" assisted greedy generationInternational Conference on Information Technology (ICIT), 2024

Domas Grigaliunas

M. Lukoševičius

118

29 Jul 2024

Graph-Structured Speculative Decoding

Dongyan Zhao

Rui Yan

195

23 Jul 2024

LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

Minsik Cho

331

19 Jul 2024

Q-Sparse: All Large Language Models can be Fully Sparsely-Activated

243

15 Jul 2024

Accelerating the inference of string generation-based chemical reaction models for industrial applications

Jürgen Schmidhuber

217

12 Jul 2024

Inference Optimization of Foundation Models on AI Accelerators

Matthäus Kleindessner

313

12 Jul 2024

Automata-based constraints for language model decoding

373

11 Jul 2024

Robotic Control via Embodied Chain-of-Thought Reasoning

Michał Zawalski

Sergey Levine

470

214

11 Jul 2024

Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

Zilong Wang

Zifeng Wang

Long Le

Huaixiu Steven Zheng

...

329

11 Jul 2024

Knowledge boosting during low-latency inference

286

09 Jul 2024

Etalon: Holistic Performance Evaluation Framework for LLM Inference Systems

258

09 Jul 2024

Mobile Edge Intelligence for Large Language Models: A Contemporary Survey

Guanqiao Qu

Qiyuan Chen

Wei Wei

Zheng Lin

Xianhao Chen

Kaibin Huang

544

157

09 Jul 2024

Merge, Ensemble, and Cooperate! A Survey on Collaborative Strategies in the Era of Large Language Models

Jiajun Zhang

395

08 Jul 2024

Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations

Bin Wang

Weiping Wang

246

08 Jul 2024

Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models

173

05 Jul 2024

Uncertainty-Guided Likelihood Tree Search

383

04 Jul 2024

Let the Code LLM Edit Itself When You Edit the Code

Jingjing Xu

276

03 Jul 2024

S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models

Parsa Kavehzadeh

Mohammadreza Pourreza

Mojtaba Valipour

Tinashu Zhu

Haoli Bai

Ali Ghodsi

Boxing Chen

Mehdi Rezagholizadeh

209

02 Jul 2024

Tree Search for Language Model Agents

404

118

01 Jul 2024

Adaptive Draft-Verification for Efficient Large Language Model Decoding

271

27 Jun 2024

SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative Decoding

235

26 Jun 2024

Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher

Hyunjong Ok

Jegwang Ryu

Jaeho Lee

131

26 Jun 2024

Make Some Noise: Unlocking Language Model Parallel Inference Capability through Noisy Training

Yixuan Wang

Yijun Liu

Qing Yang

185

25 Jun 2024

OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure

Jikai Wang

Yi Su

Juntao Li

Qingrong Xia

Zi Ye

Xinyu Duan

Zhefeng Wang

Min Zhang

435

25 Jun 2024

Speeding Up Image Classifiers with Little Companions

Yang Liu

Kowshik Thopalli

Jayaraman J. Thiagarajan

VLM

269

24 Jun 2024

EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees

Yuhui Li

Fangyun Wei

Chao Zhang

Hongyang R. Zhang

406

188

24 Jun 2024

From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models

Sean Welleck

Ilia Kulikov

Zaid Harchaoui

374

110

24 Jun 2024

Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters

Se-Young Yun

178

24 Jun 2024

Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph

...

683

21 Jun 2024

LiveMind: Low-latency Large Language Models with Simultaneous Inference

322

20 Jun 2024

Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving

Zhi Wang

Jianguo Li

180

19 Jun 2024

Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding

Bowen Zhou

355

18 Jun 2024

CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models

Bhaskar Ramasubramanian

Radha Poovendran

SILM AAML

515

18 Jun 2024

Promises, Outlooks and Challenges of Diffusion Language Modeling

Justin Deschenaux

Çağlar Gülçehre

DiffM

310

17 Jun 2024

On Giant's Shoulders: Effortless Weak to Strong by Dynamic Logits Fusion

Wei Wei

Xiaoye Qu

335

17 Jun 2024

Optimized Speculative Sampling for GPU Hardware Accelerators

Seanie Lee

216

16 Jun 2024

New Solutions on LLM Acceleration, Optimization, and Application

Deming Chen

287

16 Jun 2024

OpenVLA: An Open-Source Vision-Language-Action Model

...

Dorsa Sadigh

Percy Liang

Chelsea Finn

LM&Ro VLM

607

1,379

13 Jun 2024

Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs

Qian Liu

292

121

13 Jun 2024