ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2211.17192
  4. Cited By
Fast Inference from Transformers via Speculative Decoding
v1v2 (latest)

Fast Inference from Transformers via Speculative Decoding

International Conference on Machine Learning (ICML), 2022
30 November 2022
Yaniv Leviathan
Matan Kalman
Yossi Matias
    LRM
ArXiv (abs)PDFHTMLHuggingFace (9 upvotes)

Papers citing "Fast Inference from Transformers via Speculative Decoding"

50 / 763 papers shown
Hot PATE: Private Aggregation of Distributions for Diverse Task
Hot PATE: Private Aggregation of Distributions for Diverse Task
Edith Cohen
Benjamin Cohen-Wang
Xin Lyu
Jelani Nelson
Tamas Sarlos
Uri Stemmer
523
4
0
04 Dec 2023
TextGenSHAP: Scalable Post-hoc Explanations in Text Generation with Long
  Documents
TextGenSHAP: Scalable Post-hoc Explanations in Text Generation with Long DocumentsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
James Enouen
Hootan Nakhost
Sayna Ebrahimi
Sercan O. Arik
Yan Liu
Tomas Pfister
337
14
0
03 Dec 2023
ChatGPT's One-year Anniversary: Are Open-Source Large Language Models
  Catching up?
ChatGPT's One-year Anniversary: Are Open-Source Large Language Models Catching up?
Hailin Chen
Fangkai Jiao
Xingxuan Li
Chengwei Qin
Mathieu Ravaut
Ruochen Zhao
Caiming Xiong
Shafiq Joty
ELMCLLAI4MHLRMALM
361
31
0
28 Nov 2023
PaSS: Parallel Speculative Sampling
PaSS: Parallel Speculative Sampling
Giovanni Monea
Armand Joulin
Edouard Grave
MoE
219
45
0
22 Nov 2023
HexGen: Generative Inference of Large Language Model over Heterogeneous
  Environment
HexGen: Generative Inference of Large Language Model over Heterogeneous Environment
Youhe Jiang
Ran Yan
Xiaozhe Yao
Yang Zhou
Beidi Chen
Binhang Yuan
SyDa
224
32
0
20 Nov 2023
Speculative Contrastive Decoding
Speculative Contrastive DecodingAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Hongyi Yuan
Keming Lu
Fei Huang
Zheng Yuan
Chang Zhou
165
8
0
15 Nov 2023
Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads
  to Answers Faster
Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster
Hongxuan Zhang
Zhining Liu
Yao Zhao
Jiaqi Zheng
Chenyi Zhuang
Jinjie Gu
Guihai Chen
LRMMLLM
217
2
0
14 Nov 2023
REST: Retrieval-Based Speculative Decoding
REST: Retrieval-Based Speculative DecodingNorth American Chapter of the Association for Computational Linguistics (NAACL), 2023
Zhenyu He
Zexuan Zhong
Tianle Cai
Jason D. Lee
Di He
RALM
294
121
0
14 Nov 2023
JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal
  Language Models
JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language ModelsIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Zihao Wang
Shaofei Cai
Hoang Trung-Dung
Yonggang Jin
Jinbing Hou
...
Zhaofeng He
Zilong Zheng
Yaodong Yang
Xiaojian Ma
Yitao Liang
LLMAGLM&Ro
373
156
0
10 Nov 2023
Improving Machine Translation with Large Language Models: A Preliminary
  Study with Cooperative Decoding
Improving Machine Translation with Large Language Models: A Preliminary Study with Cooperative DecodingAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Jiali Zeng
Fandong Meng
Yongjing Yin
Jie Zhou
278
14
0
06 Nov 2023
GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values
GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values
Farnoosh Javadi
Walid Ahmed
Habib Hajimolahoseini
Foozhan Ataiefard
Mohammad Hassanpour
Saina Asani
Austin Wen
Omar Mohamed Awad
Kangling Liu
Yang Liu
VLM
303
8
0
06 Nov 2023
Divergent Token Metrics: Measuring degradation to prune away LLM
  components -- and optimize quantization
Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantizationNorth American Chapter of the Association for Computational Linguistics (NAACL), 2023
Bjorn Deiseroth
Max Meuer
Nikolas Gritsch
C. Eichenberg
P. Schramowski
Matthias Aßenmacher
Kristian Kersting
66
3
0
02 Nov 2023
Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo
  Labelling
Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling
Sanchit Gandhi
Patrick von Platen
Alexander M. Rush
VLM
340
104
0
01 Nov 2023
The Synergy of Speculative Decoding and Batching in Serving Large
  Language Models
The Synergy of Speculative Decoding and Batching in Serving Large Language Models
Qidong Su
Christina Giannoula
Gennady Pekhimenko
169
18
0
28 Oct 2023
Punica: Multi-Tenant LoRA Serving
Punica: Multi-Tenant LoRA ServingConference on Machine Learning and Systems (MLSys), 2023
Lequn Chen
Zihao Ye
Yongji Wu
Danyang Zhuo
Luis Ceze
Arvind Krishnamurthy
218
62
0
28 Oct 2023
Controlled Decoding from Language Models
Controlled Decoding from Language ModelsInternational Conference on Machine Learning (ICML), 2023
Sidharth Mudgal
Jong Lee
H. Ganapathy
Yaguang Li
Tao Wang
...
Michael Collins
Trevor Strohman
Jilin Chen
Alex Beutel
Ahmad Beirami
463
113
0
25 Oct 2023
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Elias Frantar
Dan Alistarh
MQMoE
260
37
0
25 Oct 2023
SpecTr: Fast Speculative Decoding via Optimal Transport
SpecTr: Fast Speculative Decoding via Optimal TransportNeural Information Processing Systems (NeurIPS), 2023
Ziteng Sun
A. Suresh
Jae Hun Ro
Ahmad Beirami
Himanshu Jain
Felix X. Yu
329
117
0
23 Oct 2023
Large Search Model: Redefining Search Stack in the Era of LLMs
Large Search Model: Redefining Search Stack in the Era of LLMs
Liang Wang
Nan Yang
Xiaolong Huang
Linjun Yang
Rangan Majumder
Furu Wei
LRMKELM
227
25
0
23 Oct 2023
An Emulator for Fine-Tuning Large Language Models using Small Language
  Models
An Emulator for Fine-Tuning Large Language Models using Small Language Models
Eric Mitchell
Rafael Rafailov
Archit Sharma
Chelsea Finn
Christopher D. Manning
ALM
303
65
0
19 Oct 2023
SPEED: Speculative Pipelined Execution for Efficient Decoding
SPEED: Speculative Pipelined Execution for Efficient Decoding
Coleman Hooper
Sehoon Kim
Hiva Mohammadzadeh
Hasan Genç
Kurt Keutzer
A. Gholami
Y. Shao
203
48
0
18 Oct 2023
BitNet: Scaling 1-bit Transformers for Large Language Models
BitNet: Scaling 1-bit Transformers for Large Language Models
Hongyu Wang
Shuming Ma
Li Dong
Shaohan Huang
Huaijie Wang
Lingxiao Ma
Fan Yang
Ruiping Wang
Yi Wu
Furu Wei
MQ
223
185
0
17 Oct 2023
Enhanced Transformer Architecture for Natural Language Processing
Enhanced Transformer Architecture for Natural Language ProcessingPacific Asia Conference on Language, Information and Computation (PACLIC), 2023
Woohyeon Moon
Taeyoung Kim
Bumgeun Park
Dongsoo Har
226
0
0
17 Oct 2023
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language
  Models
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models
Saleh Ashkboos
Ilia Markov
Elias Frantar
Tingxuan Zhong
Xincheng Wang
Jie Ren
Torsten Hoefler
Dan Alistarh
MQSyDa
357
35
0
13 Oct 2023
Tree-Planner: Efficient Close-loop Task Planning with Large Language
  Models
Tree-Planner: Efficient Close-loop Task Planning with Large Language ModelsInternational Conference on Learning Representations (ICLR), 2023
Mengkang Hu
Yao Mu
Xinmiao Yu
Mingyu Ding
Shiguang Wu
Wenqi Shao
Qiguang Chen
Bin Wang
Yu Qiao
Ping Luo
LLMAG
226
51
0
12 Oct 2023
DistillSpec: Improving Speculative Decoding via Knowledge Distillation
DistillSpec: Improving Speculative Decoding via Knowledge DistillationInternational Conference on Learning Representations (ICLR), 2023
Yongchao Zhou
Kaifeng Lyu
A. S. Rawat
A. Menon
Afshin Rostamizadeh
Sanjiv Kumar
Jean-François Kagy
Rishabh Agarwal
266
123
0
12 Oct 2023
MatFormer: Nested Transformer for Elastic Inference
MatFormer: Nested Transformer for Elastic InferenceNeural Information Processing Systems (NeurIPS), 2023
Devvrit
Sneha Kudugunta
Aditya Kusupati
Tim Dettmers
Kaifeng Chen
...
Yulia Tsvetkov
Hannaneh Hajishirzi
Sham Kakade
Ali Farhadi
Prateek Jain
255
61
0
11 Oct 2023
CacheGen: KV Cache Compression and Streaming for Fast Language Model
  Serving
CacheGen: KV Cache Compression and Streaming for Fast Language Model ServingConference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM), 2023
Yuhan Liu
Hanchen Li
Yihua Cheng
Siddhant Ray
Yuyang Huang
...
Ganesh Ananthanarayanan
Michael Maire
Henry Hoffmann
Ari Holtzman
Junchen Jiang
566
141
0
11 Oct 2023
Online Speculative Decoding
Online Speculative DecodingInternational Conference on Machine Learning (ICML), 2023
Xiaoxuan Liu
Lanxiang Hu
Peter Bailis
Alvin Cheung
Zhijie Deng
Ion Stoica
Hao Zhang
393
84
0
11 Oct 2023
CoQuest: Exploring Research Question Co-Creation with an LLM-based Agent
CoQuest: Exploring Research Question Co-Creation with an LLM-based AgentInternational Conference on Human Factors in Computing Systems (CHI), 2023
Yiren Liu
Si Chen
Haocong Cheng
Mengxia Yu
Xiao Ran
Andrew Mo
Yiliu Tang
Yun Huang
LLMAG
336
75
0
09 Oct 2023
Fast and Robust Early-Exiting Framework for Autoregressive Language
  Models with Synchronized Parallel Decoding
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel DecodingConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Sangmin Bae
Jongwoo Ko
Hwanjun Song
SeYoung Yun
270
78
0
09 Oct 2023
ReLU Strikes Back: Exploiting Activation Sparsity in Large Language
  Models
ReLU Strikes Back: Exploiting Activation Sparsity in Large Language ModelsInternational Conference on Learning Representations (ICLR), 2023
Iman Mirzadeh
Keivan Alizadeh-Vahid
Sachin Mehta
C. C. D. Mundo
Oncel Tuzel
Golnoosh Samei
Mohammad Rastegari
Mehrdad Farajtabar
490
100
0
06 Oct 2023
DirectGPT: A Direct Manipulation Interface to Interact with Large
  Language Models
DirectGPT: A Direct Manipulation Interface to Interact with Large Language ModelsInternational Conference on Human Factors in Computing Systems (CHI), 2023
Damien Masson
Sylvain Malacria
Géry Casiez
Daniel Vogel
AI4CEKELMMLLM
255
69
0
05 Oct 2023
Large Language Model Cascades with Mixture of Thoughts Representations
  for Cost-efficient Reasoning
Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient ReasoningInternational Conference on Learning Representations (ICLR), 2023
Murong Yue
Jie Zhao
Min Zhang
Liang Du
Ziyu Yao
LRM
351
118
0
04 Oct 2023
Alphazero-like Tree-Search can Guide Large Language Model Decoding and
  Training
Alphazero-like Tree-Search can Guide Large Language Model Decoding and TrainingInternational Conference on Machine Learning (ICML), 2023
Xidong Feng
Bo Liu
Muning Wen
Alexander Shmakov
Ying Wen
Weinan Zhang
Jun Wang
LRMAI4CE
261
286
0
29 Sep 2023
Pushing Large Language Models to the 6G Edge: Vision, Challenges, and Opportunities
Pushing Large Language Models to the 6G Edge: Vision, Challenges, and OpportunitiesIEEE Communications Magazine (IEEE Commun. Mag.), 2023
Zhengyi Lin
Guanqiao Qu
Qiyuan Chen
Randy Sarayar
Zhe Chen
Kaibin Huang
493
150
0
28 Sep 2023
Navigate through Enigmatic Labyrinth A Survey of Chain of Thought
  Reasoning: Advances, Frontiers and Future
Navigate through Enigmatic Labyrinth A Survey of Chain of Thought Reasoning: Advances, Frontiers and FutureAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Zheng Chu
Jingchang Chen
Qianglong Chen
Weijiang Yu
Tao He
Haotian Wang
Weihua Peng
Ming-Yuan Liu
Bing Qin
Ting Liu
LRMAI4CE
493
222
0
27 Sep 2023
LMDX: Language Model-based Document Information Extraction and
  Localization
LMDX: Language Model-based Document Information Extraction and LocalizationAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Vincent Perot
Kai Kang
Florian Luisier
Guolong Su
Xiaoyu Sun
...
Zifeng Wang
Jiaqi Mu
Hao Zhang
Chen-Yu Lee
Nan Hua
228
52
0
19 Sep 2023
LLMCad: Fast and Scalable On-device Large Language Model Inference
LLMCad: Fast and Scalable On-device Large Language Model Inference
Daliang Xu
Wangsong Yin
Xin Jin
Yanzhe Zhang
Shiyun Wei
Mengwei Xu
Xuanzhe Liu
207
70
0
08 Sep 2023
SortedNet: A Scalable and Generalized Framework for Training Modular
  Deep Neural Networks
SortedNet: A Scalable and Generalized Framework for Training Modular Deep Neural Networks
Mojtaba Valipour
Mehdi Rezagholizadeh
Hossein Rajabzadeh
Parsa Kavehzadeh
Marzieh S. Tahaei
Boxing Chen
Ali Ghodsi
133
2
0
01 Sep 2023
Uncertainty Estimation of Transformers' Predictions via Topological
  Analysis of the Attention Matrices
Uncertainty Estimation of Transformers' Predictions via Topological Analysis of the Attention Matrices
Elizaveta Kostenok
D. Cherniavskii
Alexey Zaytsev
249
9
0
22 Aug 2023
Accelerating LLM Inference with Staged Speculative Decoding
Accelerating LLM Inference with Staged Speculative Decoding
Benjamin Spector
Christal Re
270
150
0
08 Aug 2023
RecycleGPT: An Autoregressive Language Model with Recyclable Module
RecycleGPT: An Autoregressive Language Model with Recyclable Module
Yu Jiang
Qiaozhi He
Xiaomin Zhuang
Zhihua Wu
Kunpeng Wang
Wenlai Zhao
Guangwen Yang
KELM
275
3
0
07 Aug 2023
Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM
  Decoding
Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding
Seongjun Yang
Gibbeum Lee
Jaewoong Cho
Dimitris Papailiopoulos
Kangwook Lee
224
46
0
12 Jul 2023
Query Understanding in the Age of Large Language Models
Query Understanding in the Age of Large Language Models
Avishek Anand
Venktesh V
Abhijit Anand
Vinay Setty
LRM
259
9
0
28 Jun 2023
LMFlow: An Extensible Toolkit for Finetuning and Inference of Large
  Foundation Models
LMFlow: An Extensible Toolkit for Finetuning and Inference of Large Foundation ModelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2023
Shizhe Diao
Boyao Wang
Hanze Dong
Kashun Shum
Jipeng Zhang
Wei Xiong
Tong Zhang
ALM
297
76
0
21 Jun 2023
GLIMMER: generalized late-interaction memory reranker
GLIMMER: generalized late-interaction memory reranker
Michiel de Jong
Yury Zemlyanskiy
Nicholas FitzGerald
Sumit Sanghai
William W. Cohen
Joshua Ainslie
RALM
232
9
0
17 Jun 2023
On Optimal Caching and Model Multiplexing for Large Model Inference
On Optimal Caching and Model Multiplexing for Large Model Inference
Banghua Zhu
Ying Sheng
Lianmin Zheng
Clark W. Barrett
Sai Li
Jiantao Jiao
306
28
0
03 Jun 2023
Exploring the Practicality of Generative Retrieval on Dynamic Corpora
Exploring the Practicality of Generative Retrieval on Dynamic CorporaConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Soyoung Yoon
Chaeeun Kim
Hyunji Lee
Joel Jang
Sohee Yang
Minjoon Seo
319
6
0
27 May 2023
Large Language Models as Tool Makers
Large Language Models as Tool MakersInternational Conference on Learning Representations (ICLR), 2023
Tianle Cai
Xuezhi Wang
Tengyu Ma
Xinyun Chen
Denny Zhou
LLMAG
279
262
0
26 May 2023
Previous
123...141516
Next