v1v2 (latest)

Fast Inference from Transformers via Speculative Decoding

International Conference on Machine Learning (ICML), 2022

30 November 2022

Yaniv Leviathan

Matan Kalman

Yossi Matias

LRM

ArXiv (abs)PDF HTML HuggingFace (9 upvotes)

Papers citing "Fast Inference from Transformers via Speculative Decoding"

50 / 763 papers shown

Hot PATE: Private Aggregation of Distributions for Diverse Task

523

04 Dec 2023

TextGenSHAP: Scalable Post-hoc Explanations in Text Generation with Long DocumentsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

James Enouen

Hootan Nakhost

Sayna Ebrahimi

Sercan O. Arik

Yan Liu

Tomas Pfister

337

03 Dec 2023

ChatGPT's One-year Anniversary: Are Open-Source Large Language Models Catching up?

Hailin Chen

ELM CLL AI4MH LRM ALM

361

28 Nov 2023

PaSS: Parallel Speculative Sampling

219

22 Nov 2023

HexGen: Generative Inference of Large Language Model over Heterogeneous Environment

Youhe Jiang

Ran Yan

Xiaozhe Yao

Yang Zhou

Beidi Chen

Binhang Yuan

SyDa

224

20 Nov 2023

Speculative Contrastive DecodingAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Hongyi Yuan

Keming Lu

Fei Huang

Zheng Yuan

Chang Zhou

165

15 Nov 2023

Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster

217

14 Nov 2023

REST: Retrieval-Based Speculative DecodingNorth American Chapter of the Association for Computational Linguistics (NAACL), 2023

Tianle Cai

294

121

14 Nov 2023

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language ModelsIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

...

Xiaojian Ma

373

156

10 Nov 2023

Improving Machine Translation with Large Language Models: A Preliminary Study with Cooperative DecodingAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

278

06 Nov 2023

GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values

Farnoosh Javadi

Walid Ahmed

Habib Hajimolahoseini

303

06 Nov 2023

Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantizationNorth American Chapter of the Association for Computational Linguistics (NAACL), 2023

02 Nov 2023

Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling

340

104

01 Nov 2023

The Synergy of Speculative Decoding and Batching in Serving Large Language Models

Qidong Su

Christina Giannoula

Gennady Pekhimenko

169

28 Oct 2023

Punica: Multi-Tenant LoRA ServingConference on Machine Learning and Systems (MLSys), 2023

218

28 Oct 2023

Controlled Decoding from Language ModelsInternational Conference on Machine Learning (ICML), 2023

...

463

113

25 Oct 2023

QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

Elias Frantar

Dan Alistarh

MQ MoE

260

25 Oct 2023

SpecTr: Fast Speculative Decoding via Optimal TransportNeural Information Processing Systems (NeurIPS), 2023

329

117

23 Oct 2023

Large Search Model: Redefining Search Stack in the Era of LLMs

Liang Wang

227

23 Oct 2023

An Emulator for Fine-Tuning Large Language Models using Small Language Models

Christopher D. Manning

ALM

303

19 Oct 2023

SPEED: Speculative Pipelined Execution for Efficient Decoding

Coleman Hooper

Sehoon Kim

203

18 Oct 2023

BitNet: Scaling 1-bit Transformers for Large Language Models

Fan Yang

223

185

17 Oct 2023

Enhanced Transformer Architecture for Natural Language ProcessingPacific Asia Conference on Language, Information and Computation (PACLIC), 2023

226

17 Oct 2023

QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models

Dan Alistarh

357

13 Oct 2023

Tree-Planner: Efficient Close-loop Task Planning with Large Language ModelsInternational Conference on Learning Representations (ICLR), 2023

Mingyu Ding

Yu Qiao

Ping Luo

LLMAG

226

12 Oct 2023

DistillSpec: Improving Speculative Decoding via Knowledge DistillationInternational Conference on Learning Representations (ICLR), 2023

Sanjiv Kumar

266

123

12 Oct 2023

MatFormer: Nested Transformer for Elastic InferenceNeural Information Processing Systems (NeurIPS), 2023

Tim Dettmers

...

255

11 Oct 2023

CacheGen: KV Cache Compression and Streaming for Fast Language Model ServingConference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM), 2023

...

Ganesh Ananthanarayanan

566

141

11 Oct 2023

Online Speculative DecodingInternational Conference on Machine Learning (ICML), 2023

Xiaoxuan Liu

Peter Bailis

Hao Zhang

393

11 Oct 2023

CoQuest: Exploring Research Question Co-Creation with an LLM-based AgentInternational Conference on Human Factors in Computing Systems (CHI), 2023

336

09 Oct 2023

Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel DecodingConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

270

09 Oct 2023

ReLU Strikes Back: Exploiting Activation Sparsity in Large Language ModelsInternational Conference on Learning Representations (ICLR), 2023

Iman Mirzadeh

Keivan Alizadeh-Vahid

490

100

06 Oct 2023

DirectGPT: A Direct Manipulation Interface to Interact with Large Language ModelsInternational Conference on Human Factors in Computing Systems (CHI), 2023

255

05 Oct 2023

Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient ReasoningInternational Conference on Learning Representations (ICLR), 2023

351

118

04 Oct 2023

Alphazero-like Tree-Search can Guide Large Language Model Decoding and TrainingInternational Conference on Machine Learning (ICML), 2023

Muning Wen

261

286

29 Sep 2023

Pushing Large Language Models to the 6G Edge: Vision, Challenges, and OpportunitiesIEEE Communications Magazine (IEEE Commun. Mag.), 2023

493

150

28 Sep 2023

Navigate through Enigmatic Labyrinth A Survey of Chain of Thought Reasoning: Advances, Frontiers and FutureAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

493

222

27 Sep 2023

LMDX: Language Model-based Document Information Extraction and LocalizationAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

...

Chen-Yu Lee

228

19 Sep 2023

LLMCad: Fast and Scalable On-device Large Language Model Inference

Daliang Xu

Wangsong Yin

Xin Jin

Yanzhe Zhang

Shiyun Wei

Mengwei Xu

Xuanzhe Liu

207

08 Sep 2023

SortedNet: A Scalable and Generalized Framework for Training Modular Deep Neural Networks

133

01 Sep 2023

Uncertainty Estimation of Transformers' Predictions via Topological Analysis of the Attention Matrices

Elizaveta Kostenok

D. Cherniavskii

Alexey Zaytsev

249

22 Aug 2023

Accelerating LLM Inference with Staged Speculative Decoding

Benjamin Spector

Christal Re

270

150

08 Aug 2023

RecycleGPT: An Autoregressive Language Model with Recyclable Module

275

07 Aug 2023

Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding

Seongjun Yang

Gibbeum Lee

Jaewoong Cho

Dimitris Papailiopoulos

Kangwook Lee

224

12 Jul 2023

Query Understanding in the Age of Large Language Models

Venktesh V

259

28 Jun 2023

LMFlow: An Extensible Toolkit for Finetuning and Inference of Large Foundation ModelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2023

Wei Xiong

Tong Zhang

ALM

297

21 Jun 2023

GLIMMER: generalized late-interaction memory reranker

Sumit Sanghai

Joshua Ainslie

232

17 Jun 2023

On Optimal Caching and Model Multiplexing for Large Model Inference

306

03 Jun 2023

Exploring the Practicality of Generative Retrieval on Dynamic CorporaConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Hyunji Lee

319

27 May 2023

Large Language Models as Tool MakersInternational Conference on Learning Representations (ICLR), 2023

Tianle Cai

279

262

26 May 2023