v1v2 (latest)

Fast Inference from Transformers via Speculative Decoding

International Conference on Machine Learning (ICML), 2022

30 November 2022

Yaniv Leviathan

Matan Kalman

Yossi Matias

LRM

ArXiv (abs)PDF HTML HuggingFace (9 upvotes)

Papers citing "Fast Inference from Transformers via Speculative Decoding"

50 / 763 papers shown

Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL

Feiran Huang

Xiao Huang

853

150

12 Jun 2024

OPTune: Efficient Online Preference Tuning

Tom Goldstein

Heng Huang

130

11 Jun 2024

When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models

Haoran You

Yichao Fu

Zheng Wang

Amir Yazdanbakhsh

Yingyan Celine Lin

370

11 Jun 2024

Crayon: Customized On-Device LLM via Instant Adapter Blending and Edge-Server Hybrid Inference

190

11 Jun 2024

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

Haibo Chen

395

10 Jun 2024

Enabling Efficient Batch Serving for LMaaS via Generation Length Prediction

Zhi Wang

Jianguo Li

300

07 Jun 2024

Proofread: Fixes All Errors with One Tap

181

06 Jun 2024

To Distill or Not to Distill? On the Robustness of Robust Knowledge DistillationAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Abdul Waheed

Karima Kadaoui

Muhammad Abdul-Mageed

VLM

219

06 Jun 2024

Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism

192

06 Jun 2024

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

348

04 Jun 2024

Block Transformer: Global-to-Local Language Modeling for Fast Inference

Se-Young Yun

315

04 Jun 2024

Diver: Large Language Model Decoding with Span-Level Mutual Information Verification

Jinliang Lu

Chen Wang

Jiajun Zhang

229

04 Jun 2024

OccamLLM: Fast and Exact Language Model Arithmetic in a Single Step

Owen Dugan

Donato Manuel Jimenez Beneto

341

04 Jun 2024

GRAM: Generative Retrieval Augmented Matching of Data Schemas in the Context of Data Security

172

04 Jun 2024

SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM

Xiaoyu Yang

427

03 Jun 2024

Achieving Sparse Activation in Small Language Models

207

03 Jun 2024

Decentralized AI: Permissionless LLM Inference on POKT Network

D. Olshansky

Ramiro Rodríguez Colmeiro

Bowen Li

MoE

30 May 2024

S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs

Wei Zhong

Manasa Bharadwaj

366

30 May 2024

GKT: A Novel Guidance-Based Knowledge Transfer Framework For Efficient Cloud-edge Collaboration LLM Deployment

Yao Yao

Z. Li

Hai Zhao

141

30 May 2024

SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths

Kaixuan Huang

Xudong Guo

M. Y. Wang

534

30 May 2024

Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution

240

29 May 2024

Faster Cascades via Speculative Decoding

Harikrishna Narasimhan

Sanjiv Kumar

376

29 May 2024

Nearest Neighbor Speculative Decoding for LLM Generation and Attribution

742

29 May 2024

Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass

365

28 May 2024

Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference

Konstantin Mishchenko

Stylianos I. Venieris

Hongxiang Fan

290

28 May 2024

Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection

...

254

25 May 2024

Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs

Fuzheng Zhang

...

171

24 May 2024

A Declarative System for Optimizing AI Workloads

Michael Cafarella

264

23 May 2024

Integer Scale: A Free Lunch for Faster Fine-grained Quantization of LLMs

238

23 May 2024

Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model InferenceInternational Conference on Learning Representations (ICLR), 2024

266

23 May 2024

Modeling Real-Time Interactive Conversations as Timed Diarized Transcripts

Garrett Tanzer

Gustaf Ahdritz

Luke Melas-Kyriazi

345

21 May 2024

Towards Modular LLMs by Building and Reusing a Library of LoRAsInternational Conference on Machine Learning (ICML), 2024

Nicolas Le Roux

258

18 May 2024

A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models

Vinija Jain

350

15 May 2024

Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis

Yao Fu

198

14 May 2024

EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models

253

13 May 2024

SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts

...

238

13 May 2024

A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language ModelsKnowledge Discovery and Data Mining (KDD), 2024

625

630

10 May 2024

KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache GenerationInternational Conference on Machine Learning (ICML), 2024

Minsik Cho

Mohammad Rastegari

Devang Naik

223

08 May 2024

Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models

197

07 May 2024

Optimising Calls to Large Language Models with Uncertainty-Based Two-Tier Selection

Guillem Ramírez

Alexandra Birch

Ivan Titov

327

03 May 2024

Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge

Bin Xiao

Chunan Shi

Xiaonan Nie

Fan Yang

270

01 May 2024

Better & Faster Large Language Models via Multi-token Prediction

Baptiste Rozière

313

222

30 Apr 2024

Accelerating Production LLMs with Combined Token/Embedding Speculators

349

29 Apr 2024

Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting

305

29 Apr 2024

BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models

251

28 Apr 2024

Reinforcement Retrieval Leveraging Fine-grained Feedback for Fact Checking News Claims with Black-Box LLM

Xuan Zhang

Wei Gao

LRM KELM

284

26 Apr 2024

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

...

Saurabh Agarwal

261

190

25 Apr 2024

BASS: Batched Attention-optimized Speculative Sampling

Haifeng Qian

Sujan Kumar Gonugondla

288

24 Apr 2024

Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models

Chen Zhang

Zhuorui Liu

Dawei Song

LRM

277

23 Apr 2024

Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing

Ahmed Hassan Awadallah

410

202

22 Apr 2024