v1v2 (latest)

Fast Inference from Transformers via Speculative Decoding

International Conference on Machine Learning (ICML), 2022

30 November 2022

Yaniv Leviathan

Matan Kalman

Yossi Matias

LRM

ArXiv (abs)PDF HTML HuggingFace (9 upvotes)

Papers citing "Fast Inference from Transformers via Speculative Decoding"

50 / 763 papers shown

MatMamba: A Matryoshka State Space Model

252

09 Oct 2024

SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference AccelerationInternational Conference on Learning Representations (ICLR), 2024

Yongqi Li

Wenjie Li

334

09 Oct 2024

Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level

1.1K

09 Oct 2024

A Survey: Collaborative Hardware and Software Design in the Era of Large Language ModelsIEEE Circuits and Systems Magazine (IEEE CSM), 2024

...

Yiran Chen

226

08 Oct 2024

ParallelSpec: Parallel Drafter for Efficient Speculative Decoding

Hongming Zhang

Siru Ouyang

251

08 Oct 2024

Efficient Inference for Large Language Model-based Generative RecommendationInternational Conference on Learning Representations (ICLR), 2024

Xinyu Lin

Chaoqun Yang

Wenjie Wang

Yongqi Li

371

07 Oct 2024

Rational Metareasoning for Large Language Models

445

07 Oct 2024

RevMUX: Data Multiplexing with Reversible Adapters for Efficient LLM Batch InferenceConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Yige Xu

Xu Guo

Zhiwei Zeng

Chunyan Miao

189

06 Oct 2024

Geometric Collaborative Filtering with ConvergenceInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2024

Hisham Husain

Julien Monteil

FedML

453

04 Oct 2024

Mixture of Attentions For Speculative DecodingInternational Conference on Learning Representations (ICLR), 2024

Matthieu Zimmer

Milan Gritta

Gerasimos Lampouras

Haitham Bou Ammar

Jun Wang

342

04 Oct 2024

SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation

346

04 Oct 2024

LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative DecodingInternational Conference on Learning Representations (ICLR), 2024

Eunho Yang

473

04 Oct 2024

Efficiently Deploying LLMs with Controlled Risk

Michael J. Zellinger

Matt Thomson

281

03 Oct 2024

Better Instruction-Following Through Minimum Bayes RiskInternational Conference on Learning Representations (ICLR), 2024

Graham Neubig

594

03 Oct 2024

Selective Attention Improves TransformerInternational Conference on Learning Representations (ICLR), 2024

Yaniv Leviathan

Matan Kalman

Yossi Matias

359

03 Oct 2024

Inductive Generative Recommendation via Retrieval-based Speculation

138

03 Oct 2024

Interpretable Contrastive Monte Carlo Tree Search Reasoning

Aiwei Liu

Xuming Hu

Lijie Wen

LRM

479

02 Oct 2024

Integrative Decoding: Improve Factuality via Implicit Self-consistency

Yeyun Gong

...

Wenjie Li

Jian Jiao

Qi Chen

Peng Cheng

Wayne Xiong

HILM

509

02 Oct 2024

Speculative Coreset Selection for Task-Specific Fine-tuning

Xiaoyu Zhang

Chao Shen

Yang Liu

211

02 Oct 2024

Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi DecodingInternational Conference on Learning Representations (ICLR), 2024

Yu Wang

Zhenguo Li

Xihui Liu

384

02 Oct 2024

Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine SimilarityConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Michael R. Metel

Peng Lu

Boxing Chen

Mehdi Rezagholizadeh

I. Kobyzev

170

01 Oct 2024

Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models

Mohammad Hossein Sekhavat

Moin Nabi

Mehrdad Farajtabar

MoE

279

01 Oct 2024

Approximately Aligned Decoding

Daniel Melcer

Sujan Kumar Gonugondla

301

01 Oct 2024

Interactive Speculative Planning: Enhance Agent Efficiency through Co-design of System and User Interface

Yongfeng Zhang

195

30 Sep 2024

Characterizing and Efficiently Accelerating Multimodal Generation Model Inference

...

475

30 Sep 2024

The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving SystemsIEEE Transactions on Information Forensics and Security (IEEE TIFS), 2024

617

30 Sep 2024

Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference

Yizhou Sun

284

25 Sep 2024

Accumulator-Aware Post-Training Quantization for Large Language Models

277

25 Sep 2024

Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR

178

24 Sep 2024

Efficiently Dispatching Flash Attention For Partially Filled Attention Masks

Agniv Sharma

Jonas Geiping

217

23 Sep 2024

CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMsIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

Junlin Lv

Yuan Feng

287

19 Sep 2024

Improving Multi-candidate Speculative Decoding

106

16 Sep 2024

Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with Selective Cloud Assistance

Adarsh MS

Jithin VG

Ditto PS

118

15 Sep 2024

What is the Role of Small Models in the LLM Era: A Survey

Lihu Chen

Gaël Varoquaux

ALM

784

10 Sep 2024

Recall: Empowering Multimodal Embedding for Edge Devices

Dongqi Cai

Shangguang Wang

Chen Peng

Zeling Zhang

Mengwei Xu

180

09 Sep 2024

An overview of domain-specific foundation model: key technologies, applications and challengesScience China Information Sciences (Sci. China Inf. Sci.), 2024

492

06 Sep 2024

CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective SparsificationConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Chun Jason Xue

02 Sep 2024

Dynamic Depth Decoding: Faster Speculative Decoding for LLMs

293

30 Aug 2024

Bidirectional Decoding: Improving Action Chunking via Guided Test-Time SamplingInternational Conference on Learning Representations (ICLR), 2024

371

30 Aug 2024

Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation

Lujun Gui

Bin Xiao

Lei Su

Weipeng Chen

190

28 Aug 2024

Learning Harmonized Representations for Speculative SamplingInternational Conference on Learning Representations (ICLR), 2024

314

28 Aug 2024

NanoFlow: Towards Optimal Large Language Model Serving Throughput

...

Chien-Yu Lin

229

22 Aug 2024

Predicting Rewards Alongside Tokens: Non-disruptive Parameter Insertion for Efficient Inference Intervention in Large Language ModelConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Chenhan Yuan

Fei Huang

Ru Peng

Keming Lu

Bowen Yu

Chang Zhou

Jingren Zhou

KELM

217

20 Aug 2024

MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative DecodingInternational Conference on Learning Representations (ICLR), 2024

Avner May

Tianqi Chen

Beidi Chen

LRM

677

20 Aug 2024

Parallel Sampling via CountingSymposium on the Theory of Computing (STOC), 2024

Nima Anari

Ruiquan Gao

Aviad Rubinstein

184

18 Aug 2024

Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Jerry Huang

Prasanna Parthasarathi

Mehdi Rezagholizadeh

Sarath Chandar

233

16 Aug 2024

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token RecyclingAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

454

16 Aug 2024

P/D-Serve: Serving Disaggregated Large Language Model at Scale

...

Haoliang Cheng

215

15 Aug 2024

KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft Heads with Adversarial LearningInternational Conference on Computer Supported Cooperative Work in Design (CSCWD), 2024

Kaiqi Zhang

Jing Zhao

Rui Chen

315

15 Aug 2024

Coupling without Communication and Drafter-Invariant Speculative DecodingInternational Symposium on Information Theory (ISIT), 2024

Majid Daliri

Christopher Musco

A. Suresh

398

15 Aug 2024