ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2211.17192
  4. Cited By
Fast Inference from Transformers via Speculative Decoding
v1v2 (latest)

Fast Inference from Transformers via Speculative Decoding

International Conference on Machine Learning (ICML), 2022
30 November 2022
Yaniv Leviathan
Matan Kalman
Yossi Matias
    LRM
ArXiv (abs)PDFHTMLHuggingFace (9 upvotes)

Papers citing "Fast Inference from Transformers via Speculative Decoding"

50 / 763 papers shown
Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU
  Heterogeneity
Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity
Tyler Griggs
Xiaoxuan Liu
Jiaxiang Yu
Doyoung Kim
Wei-Lin Chiang
Alvin Cheung
Ion Stoica
336
27
0
22 Apr 2024
SnapKV: LLM Knows What You are Looking for Before Generation
SnapKV: LLM Knows What You are Looking for Before Generation
Yuhong Li
Yingbing Huang
Bowen Yang
Bharat Venkitesh
Acyr Locatelli
Hanchen Ye
Tianle Cai
Patrick Lewis
Deming Chen
VLM
413
383
0
22 Apr 2024
A Survey on Efficient Inference for Large Language Models
A Survey on Efficient Inference for Large Language Models
Zixuan Zhou
Xuefei Ning
Ke Hong
Tianyu Fu
Jiaming Xu
...
Shengen Yan
Guohao Dai
Xiao-Ping Zhang
Yuhan Dong
Yu Wang
420
174
0
22 Apr 2024
Parallel Decoding via Hidden Transfer for Lossless Large Language Model
  Acceleration
Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration
Pengfei Wu
Jiahao Liu
Zhuocheng Gong
Qifan Wang
Jinpeng Li
Jingang Wang
Xunliang Cai
Dongyan Zhao
188
3
0
18 Apr 2024
TriForce: Lossless Acceleration of Long Sequence Generation with
  Hierarchical Speculative Decoding
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
Hanshi Sun
Zhuoming Chen
Xinyu Yang
Yuandong Tian
Beidi Chen
369
84
0
18 Apr 2024
Language Model Cascades: Token-level uncertainty and beyond
Language Model Cascades: Token-level uncertainty and beyond
Neha Gupta
Harikrishna Narasimhan
Wittawat Jitkrittum
A. S. Rawat
A. Menon
Sanjiv Kumar
UQLM
464
90
0
15 Apr 2024
Improving Recall of Large Language Models: A Model Collaboration
  Approach for Relational Triple Extraction
Improving Recall of Large Language Models: A Model Collaboration Approach for Relational Triple Extraction
Zepeng Ding
Wenhao Huang
Jiaqing Liang
Deqing Yang
Yanghua Xiao
KELM
210
13
0
15 Apr 2024
Prepacking: A Simple Method for Fast Prefilling and Increased Throughput
  in Large Language Models
Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models
Siyan Zhao
Daniel Israel
Karen Ullrich
Aditya Grover
KELMVLM
186
9
0
15 Apr 2024
Exploring and Improving Drafts in Blockwise Parallel Decoding
Exploring and Improving Drafts in Blockwise Parallel Decoding
Taehyeon Kim
A. Suresh
Kishore Papineni
Michael Riley
Sanjiv Kumar
Adrian Benton
AI4TS
281
4
0
14 Apr 2024
On Speculative Decoding for Multimodal Large Language Models
On Speculative Decoding for Multimodal Large Language Models
Mukul Gagrani
Raghavv Goel
Wonseok Jeon
Junyoung Park
Mingu Lee
Christopher Lott
LRM
172
21
0
13 Apr 2024
Efficient Interactive LLM Serving with Proxy Model-based Sequence Length
  Prediction
Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction
Haoran Qiu
Weichao Mao
Archit Patke
Shengkun Cui
Saurabh Jha
Chen Wang
Hubertus Franke
Zbigniew T. Kalbarczyk
Tamer Basar
Ravishankar K. Iyer
227
49
0
12 Apr 2024
Reducing hallucination in structured outputs via Retrieval-Augmented
  Generation
Reducing hallucination in structured outputs via Retrieval-Augmented Generation
Patrice Béchard
Orlando Marquez Ayala
LLMAG
245
119
0
12 Apr 2024
Lossless Acceleration of Large Language Model via Adaptive N-gram
  Parallel Decoding
Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel DecodingNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Jie Ou
Yueming Chen
Wenhong Tian
302
23
0
10 Apr 2024
CQIL: Inference Latency Optimization with Concurrent Computation of
  Quasi-Independent Layers
CQIL: Inference Latency Optimization with Concurrent Computation of Quasi-Independent LayersAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Longwei Zou
Qingyang Wang
Han Zhao
Tingfeng Liu
Yi Yang
Yangdong Deng
248
1
0
10 Apr 2024
Dense Training, Sparse Inference: Rethinking Training of
  Mixture-of-Experts Language Models
Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models
Bowen Pan
Songlin Yang
Haokun Liu
Mayank Mishra
Gaoyuan Zhang
Aude Oliva
Colin Raffel
Yikang Shen
MoE
262
32
0
08 Apr 2024
Training LLMs over Neurally Compressed Text
Training LLMs over Neurally Compressed Text
Brian Lester
Jaehoon Lee
A. Alemi
Jeffrey Pennington
Adam Roberts
Jascha Narain Sohl-Dickstein
Noah Constant
207
11
0
04 Apr 2024
The Larger the Better? Improved LLM Code-Generation via Budget
  Reallocation
The Larger the Better? Improved LLM Code-Generation via Budget Reallocation
Michael Hassid
Tal Remez
Jonas Gehring
Roy Schwartz
Yossi Adi
272
40
0
31 Mar 2024
SDSAT: Accelerating LLM Inference through Speculative Decoding with
  Semantic Adaptive Tokens
SDSAT: Accelerating LLM Inference through Speculative Decoding with Semantic Adaptive Tokens
Chengbo Liu
Yong Zhu
124
2
0
27 Mar 2024
The Unreasonable Ineffectiveness of the Deeper Layers
The Unreasonable Ineffectiveness of the Deeper Layers
Andrey Gromov
Kushal Tirumala
Hassan Shapourian
Paolo Glorioso
Daniel A. Roberts
428
158
0
26 Mar 2024
Multi-Level Explanations for Generative Language Models
Multi-Level Explanations for Generative Language Models
Lucas Monteiro Paes
Dennis L. Wei
Hyo Jin Do
Hendrik Strobelt
Ronny Luss
...
Manish Nagireddy
Karthikeyan N. Ramamurthy
P. Sattigeri
Werner Geyer
Soumya Ghosh
FAtt
316
14
0
21 Mar 2024
Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks
Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks
Bo-Ru Lu
Nikita Haduong
Chien-Yu Lin
Hao Cheng
Noah A. Smith
Mari Ostendorf
AI4CE
209
1
0
19 Mar 2024
Toward Sustainable GenAI using Generation Directives for Carbon-Friendly
  Large Language Model Inference
Toward Sustainable GenAI using Generation Directives for Carbon-Friendly Large Language Model Inference
Baolin Li
Yankai Jiang
V. Gadepally
Devesh Tiwari
244
23
0
19 Mar 2024
MELTing point: Mobile Evaluation of Language Transformers
MELTing point: Mobile Evaluation of Language Transformers
Stefanos Laskaridis
Kleomenis Katevas
Lorenzo Minto
Hamed Haddadi
301
34
0
19 Mar 2024
Recurrent Drafter for Fast Speculative Decoding in Large Language Models
Recurrent Drafter for Fast Speculative Decoding in Large Language Models
Aonan Zhang
Chong-Jun Wang
Yi Wang
Xuanyu Zhang
Yunfei Cheng
291
26
0
14 Mar 2024
Token Alignment via Character Matching for Subword Completion
Token Alignment via Character Matching for Subword CompletionAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Ben Athiwaratkun
Shiqi Wang
Mingyue Shang
Yuchen Tian
Zijian Wang
Sujan Kumar Gonugondla
Sanjay Krishna Gouda
Rob Kwiatowski
Ramesh Nallapati
Bing Xiang
192
9
0
13 Mar 2024
Bifurcated Attention: Accelerating Massively Parallel Decoding with
  Shared Prefixes in LLMs
Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs
Ben Athiwaratkun
Sujan Kumar Gonugondla
Sanjay Krishna Gouda
Haifeng Qian
Hantian Ding
...
Liangfu Chen
Parminder Bhatia
Ramesh Nallapati
Sudipta Sengupta
Bing Xiang
267
5
0
13 Mar 2024
CHAI: Clustered Head Attention for Efficient LLM Inference
CHAI: Clustered Head Attention for Efficient LLM InferenceInternational Conference on Machine Learning (ICML), 2024
Saurabh Agarwal
Bilge Acun
Basil Homer
Mostafa Elhoushi
Yejin Lee
Shivaram Venkataraman
Dimitris Papailiopoulos
Carole-Jean Wu
257
12
0
12 Mar 2024
Rethinking Generative Large Language Model Evaluation for Semantic
  Comprehension
Rethinking Generative Large Language Model Evaluation for Semantic ComprehensionInternational Conference on Machine Learning (ICML), 2024
Fangyun Wei
Xi Chen
Linzi Luo
ELMALMLRM
201
14
0
12 Mar 2024
Learning to Decode Collaboratively with Multiple Language Models
Learning to Decode Collaboratively with Multiple Language Models
Zejiang Shen
Hunter Lang
Bailin Wang
Yoon Kim
David Sontag
160
52
0
06 Mar 2024
Apollo: A Lightweight Multilingual Medical LLM towards Democratizing
  Medical AI to 6B People
Apollo: A Lightweight Multilingual Medical LLM towards Democratizing Medical AI to 6B People
Xidong Wang
Nuo Chen
Junying Chen
Yan Hu
Yidong Wang
Xiangbo Wu
Anningzhe Gao
Xiang Wan
Haizhou Li
Benyou Wang
LM&MA
298
42
0
06 Mar 2024
CoGenesis: A Framework Collaborating Large and Small Language Models for
  Secure Context-Aware Instruction Following
CoGenesis: A Framework Collaborating Large and Small Language Models for Secure Context-Aware Instruction Following
Kaiyan Zhang
Jianyu Wang
Ermo Hua
Biqing Qi
Ning Ding
Bowen Zhou
SyDa
357
35
0
05 Mar 2024
SynCode: LLM Generation with Grammar Augmentation
SynCode: LLM Generation with Grammar Augmentation
Shubham Ugare
Tarun Suresh
Hangoo Kang
Sasa Misailovic
Gagandeep Singh
294
37
0
03 Mar 2024
Accelerating Greedy Coordinate Gradient via Probe Sampling
Accelerating Greedy Coordinate Gradient via Probe Sampling
Yiran Zhao
Wenyue Zheng
Tianle Cai
Xuan Long Do
Kenji Kawaguchi
Anirudh Goyal
Michael Shieh
322
2
0
02 Mar 2024
IntactKV: Improving Large Language Model Quantization by Keeping Pivot
  Tokens Intact
IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact
Ruikang Liu
Haoli Bai
Haokun Lin
Yuening Li
Han Gao
Zheng-Jun Xu
Lu Hou
Jun Yao
Chun Yuan
MQ
254
44
0
02 Mar 2024
Direct Alignment of Draft Model for Speculative Decoding with
  Chat-Fine-Tuned LLMs
Direct Alignment of Draft Model for Speculative Decoding with Chat-Fine-Tuned LLMs
Raghavv Goel
Mukul Gagrani
Wonseok Jeon
Junyoung Park
Mingu Lee
Christopher Lott
ALM
276
11
0
29 Feb 2024
Retrieval-Augmented Generation for AI-Generated Content: A Survey
Retrieval-Augmented Generation for AI-Generated Content: A Survey
Penghao Zhao
Hailin Zhang
Qinhan Yu
Zhengren Wang
Yunteng Geng
Fangcheng Fu
Ling Yang
Wentao Zhang
Jie Jiang
Tengjiao Wang
3DV
964
463
0
29 Feb 2024
CLLMs: Consistency Large Language Models
CLLMs: Consistency Large Language Models
Siqi Kou
Lanxiang Hu
Zhe He
Zhijie Deng
Hao Zhang
464
54
0
28 Feb 2024
On the Challenges and Opportunities in Generative AI
On the Challenges and Opportunities in Generative AI
Laura Manduchi
Kushagra Pandey
Kushagra Pandey
Robert Bamler
Sina Daubener
...
Yixin Wang
F. Wenzel
Frank Wood
Stephan Mandt
Vincent Fortuin
761
40
0
28 Feb 2024
Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding
Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding
Benjamin Bergner
Andrii Skliar
Amelie Royer
Tijmen Blankevoort
Yuki Markus Asano
B. Bejnordi
298
14
0
26 Feb 2024
Investigating the Effectiveness of HyperTuning via Gisting
Investigating the Effectiveness of HyperTuning via Gisting
Jason Phang
296
2
0
26 Feb 2024
LLM Inference Unveiled: Survey and Roofline Model Insights
LLM Inference Unveiled: Survey and Roofline Model Insights
Zhihang Yuan
Yuzhang Shang
Yang Zhou
Zhen Dong
Zhe Zhou
...
Yong Jae Lee
Yan Yan
Beidi Chen
Guangyu Sun
Kurt Keutzer
629
149
0
26 Feb 2024
Chimera: A Lossless Decoding Method for Accelerating Large Language
  Models Inference by Fusing all Tokens
Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens
Huiping Zhuang
Jiahong Yu
Qianshi Pang
Zihao Wang
Huiping Zhuang
Cen Chen
Xiaofeng Zou
252
6
0
24 Feb 2024
RelayAttention for Efficient Large Language Model Serving with Long
  System Prompts
RelayAttention for Efficient Large Language Model Serving with Long System Prompts
Lei Zhu
Xinjiang Wang
Wayne Zhang
Rynson W. H. Lau
230
14
0
22 Feb 2024
T-Stitch: Accelerating Sampling in Pre-Trained Diffusion Models with
  Trajectory Stitching
T-Stitch: Accelerating Sampling in Pre-Trained Diffusion Models with Trajectory Stitching
Zizheng Pan
Bohan Zhuang
De-An Huang
Weili Nie
Zhiding Yu
Chaowei Xiao
Jianfei Cai
A. Anandkumar
232
25
0
21 Feb 2024
Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster
  Speculative Decoding
Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding
Weilin Zhao
Yuxiang Huang
Xu Han
Wang Xu
Chaojun Xiao
Xinrong Zhang
Yewei Fang
Kaihuo Zhang
Zhiyuan Liu
Maosong Sun
276
23
0
21 Feb 2024
ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity
  within Large Language Models
ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models
Chenyang Song
Xu Han
Zhengyan Zhang
Shengding Hu
Xiyu Shi
...
Chen Chen
Zhiyuan Liu
Guanglin Li
Tao Yang
Maosong Sun
376
41
0
21 Feb 2024
ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel
  Decoding
ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel Decoding
Shuzhang Zhong
Zebin Yang
Meng Li
Ruihao Gong
Runsheng Wang
Ru Huang
217
13
0
21 Feb 2024
Purifying Large Language Models by Ensembling a Small Language Model
Purifying Large Language Models by Ensembling a Small Language Model
Tianlin Li
Qian Liu
Tianyu Pang
Chao Du
Qing Guo
Yang Liu
Min Lin
218
26
0
19 Feb 2024
Speech Translation with Speech Foundation Models and Large Language
  Models: What is There and What is Missing?
Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?
Marco Gaido
Sara Papi
Matteo Negri
L. Bentivogli
469
26
0
19 Feb 2024
Revisiting Knowledge Distillation for Autoregressive Language Models
Revisiting Knowledge Distillation for Autoregressive Language Models
Qihuang Zhong
Liang Ding
Li Shen
Juhua Liu
Bo Du
Dacheng Tao
KELM
309
27
0
19 Feb 2024
Previous
123...1213141516
Next