ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2211.17192
  4. Cited By
Fast Inference from Transformers via Speculative Decoding
v1v2 (latest)

Fast Inference from Transformers via Speculative Decoding

International Conference on Machine Learning (ICML), 2022
30 November 2022
Yaniv Leviathan
Matan Kalman
Yossi Matias
    LRM
ArXiv (abs)PDFHTMLHuggingFace (9 upvotes)

Papers citing "Fast Inference from Transformers via Speculative Decoding"

50 / 763 papers shown
Kraken: Inherently Parallel Transformers For Efficient Multi-Device
  Inference
Kraken: Inherently Parallel Transformers For Efficient Multi-Device InferenceNeural Information Processing Systems (NeurIPS), 2024
R. Prabhakar
Hengrui Zhang
D. Wentzlaff
294
1
0
14 Aug 2024
PEARL: Parallel Speculative Decoding with Adaptive Draft Length
PEARL: Parallel Speculative Decoding with Adaptive Draft LengthInternational Conference on Learning Representations (ICLR), 2024
Tianyu Liu
Yun Li
Qitan Lv
Kai Liu
Jianchen Zhu
Winston Hu
Xingwu Sun
383
44
0
13 Aug 2024
Post-Training Sparse Attention with Double Sparsity
Post-Training Sparse Attention with Double Sparsity
Shuo Yang
Ying Sheng
Joseph E. Gonzalez
Ion Stoica
Lianmin Zheng
296
25
0
11 Aug 2024
Efficiency Unleashed: Inference Acceleration for LLM-based Recommender Systems with Speculative Decoding
Efficiency Unleashed: Inference Acceleration for LLM-based Recommender Systems with Speculative DecodingAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2024
Yunjia Xi
Hangyu Wang
Bo Chen
Jianghao Lin
Menghui Zhu
Wen Liu
Ruiming Tang
Zhewei Wei
Weinan Zhang
Yong Yu
OffRL
401
6
0
11 Aug 2024
Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion
Speculative Diffusion Decoding: Accelerating Language Generation through DiffusionNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Jacob K Christopher
Brian Bartoldson
Tal Ben-Nun
Michael Cardei
B. Kailkhura
Ferdinando Fioretto
DiffM
526
25
0
10 Aug 2024
Retrieval-augmented code completion for local projects using large language models
Retrieval-augmented code completion for local projects using large language modelsExpert systems with applications (ESWA), 2024
Marko Hostnik
Marko Robnik-Sikonja
RALM
275
3
0
09 Aug 2024
CREST: Effectively Compacting a Datastore For Retrieval-Based
  Speculative Decoding
CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding
Sophia Ho
Jinsol Park
Patrick Wang
211
0
0
08 Aug 2024
StructuredRAG: JSON Response Formatting with Large Language Models
StructuredRAG: JSON Response Formatting with Large Language Models
Connor Shorten
Charles Pierse
Thomas Benjamin Smith
Erika Cardenas
Akanksha Sharma
John Trengrove
Bob van Luijt
292
23
0
07 Aug 2024
Inference Optimizations for Large Language Models: Effects, Challenges,
  and Practical Considerations
Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations
Leo Donisch
Sigurd Schacht
Carsten Lanquillon
298
3
0
06 Aug 2024
Clover-2: Accurate Inference for Regressive Lightweight Speculative
  Decoding
Clover-2: Accurate Inference for Regressive Lightweight Speculative Decoding
Bin Xiao
Lujun Gui
Lei Su
Weipeng Chen
195
5
0
01 Aug 2024
ThinK: Thinner Key Cache by Query-Driven Pruning
ThinK: Thinner Key Cache by Query-Driven PruningInternational Conference on Learning Representations (ICLR), 2024
Yuhui Xu
Zhanming Jie
Hanze Dong
Lei Wang
Xudong Lu
Aojun Zhou
Amrita Saha
Caiming Xiong
Doyen Sahoo
533
41
0
30 Jul 2024
Inference acceleration for large language models using "stairs" assisted
  greedy generation
Inference acceleration for large language models using "stairs" assisted greedy generationInternational Conference on Information Technology (ICIT), 2024
Domas Grigaliunas
M. Lukoševičius
118
0
0
29 Jul 2024
Graph-Structured Speculative Decoding
Graph-Structured Speculative Decoding
Zhuocheng Gong
Jiahao Liu
Ziyue Wang
Pengfei Wu
Jingang Wang
Xunliang Cai
Dongyan Zhao
Rui Yan
195
6
0
23 Jul 2024
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
Qichen Fu
Minsik Cho
Thomas Merth
Sachin Mehta
Mohammad Rastegari
Mahyar Najibi
331
62
0
19 Jul 2024
Q-Sparse: All Large Language Models can be Fully Sparsely-Activated
Q-Sparse: All Large Language Models can be Fully Sparsely-Activated
Hongyu Wang
Shuming Ma
Ruiping Wang
Furu Wei
MoE
243
22
0
15 Jul 2024
Accelerating the inference of string generation-based chemical reaction
  models for industrial applications
Accelerating the inference of string generation-based chemical reaction models for industrial applications
Mikhail Andronov
Natalia Andronova
Michael Wand
Jürgen Schmidhuber
Djork-Arné Clevert
AI4CE
217
5
0
12 Jul 2024
Inference Optimization of Foundation Models on AI Accelerators
Inference Optimization of Foundation Models on AI Accelerators
Youngsuk Park
Kailash Budhathoki
Liangfu Chen
Jonas M. Kübler
Jiaji Huang
Matthäus Kleindessner
Jun Huan
Volkan Cevher
Yida Wang
George Karypis
313
14
0
12 Jul 2024
Automata-based constraints for language model decoding
Automata-based constraints for language model decoding
Terry Koo
Frederick Liu
Luheng He
AI4CE
373
40
0
11 Jul 2024
Robotic Control via Embodied Chain-of-Thought Reasoning
Robotic Control via Embodied Chain-of-Thought Reasoning
Michał Zawalski
William Chen
Karl Pertsch
Oier Mees
Chelsea Finn
Sergey Levine
LRMLM&Ro
470
214
0
11 Jul 2024
Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting
Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting
Zilong Wang
Zifeng Wang
Long Le
Huaixiu Steven Zheng
Swaroop Mishra
...
Anush Mattapalli
Ankur Taly
Jingbo Shang
Zifeng Wang
Tomas Pfister
RALM
329
74
0
11 Jul 2024
Knowledge boosting during low-latency inference
Knowledge boosting during low-latency inference
Vidya Srinivas
Malek Itani
Tuochao Chen
Sefik Emre Eskimez
Takuya Yoshioka
Shyamnath Gollakota
286
3
0
09 Jul 2024
Etalon: Holistic Performance Evaluation Framework for LLM Inference
  Systems
Etalon: Holistic Performance Evaluation Framework for LLM Inference Systems
Amey Agrawal
Anmol Agarwal
Nitin Kedia
Jayashree Mohan
Souvik Kundu
Nipun Kwatra
Ramachandran Ramjee
Alexey Tumanov
258
9
0
09 Jul 2024
Mobile Edge Intelligence for Large Language Models: A Contemporary Survey
Mobile Edge Intelligence for Large Language Models: A Contemporary Survey
Guanqiao Qu
Qiyuan Chen
Wei Wei
Zheng Lin
Xianhao Chen
Kaibin Huang
544
157
0
09 Jul 2024
Merge, Ensemble, and Cooperate! A Survey on Collaborative Strategies in
  the Era of Large Language Models
Merge, Ensemble, and Cooperate! A Survey on Collaborative Strategies in the Era of Large Language Models
Jinliang Lu
Ziliang Pang
Min Xiao
Yaochen Zhu
Rui Xia
Jiajun Zhang
MoMe
395
48
0
08 Jul 2024
Pruning Large Language Models to Intra-module Low-rank Architecture with
  Transitional Activations
Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations
Bowen Shen
Zheng Lin
Daren Zha
Wei Liu
Jian Luan
Bin Wang
Weiping Wang
246
3
0
08 Jul 2024
Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of
  Language Models
Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models
Bolaji Yusuf
M. Baskar
Andrew Rosenberg
Bhuvana Ramabhadran
173
2
0
05 Jul 2024
Uncertainty-Guided Likelihood Tree Search
Uncertainty-Guided Likelihood Tree Search
Julia Grosse
Ruotian Wu
Ahmad Rashid
Cheng Zhang
Philipp Hennig
Pascal Poupart
Agustinus Kristiadi
383
3
0
04 Jul 2024
Let the Code LLM Edit Itself When You Edit the Code
Let the Code LLM Edit Itself When You Edit the Code
Zhenyu He
Jun Zhang
Shengjie Luo
Jingjing Xu
Zongzhang Zhang
Di He
KELM
276
3
0
03 Jul 2024
S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested
  Large Language Models
S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models
Parsa Kavehzadeh
Mohammadreza Pourreza
Mojtaba Valipour
Tinashu Zhu
Haoli Bai
Ali Ghodsi
Boxing Chen
Mehdi Rezagholizadeh
209
1
0
02 Jul 2024
Tree Search for Language Model Agents
Tree Search for Language Model Agents
Jing Yu Koh
Alexander Shmakov
Daniel Fried
Ruslan Salakhutdinov
LRMLM&RoLLMAG
404
118
0
01 Jul 2024
Adaptive Draft-Verification for Efficient Large Language Model Decoding
Adaptive Draft-Verification for Efficient Large Language Model Decoding
Xukun Liu
Bowen Lei
Ruqi Zhang
Dongkuan Xu
271
7
0
27 Jun 2024
SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative
  Decoding
SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative Decoding
Zhenglin Wang
Jialong Wu
Yilong Lai
Congzhi Zhang
Deyu Zhou
LRMReLM
235
11
0
26 Jun 2024
Decoding with Limited Teacher Supervision Requires Understanding When to
  Trust the Teacher
Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher
Hyunjong Ok
Jegwang Ryu
Jaeho Lee
131
0
0
26 Jun 2024
Make Some Noise: Unlocking Language Model Parallel Inference Capability
  through Noisy Training
Make Some Noise: Unlocking Language Model Parallel Inference Capability through Noisy Training
Yixuan Wang
Xianzhen Luo
Fuxuan Wei
Yijun Liu
Qingfu Zhu
Xuanyu Zhang
Qing Yang
Dongliang Xu
Wanxiang Che
185
5
0
25 Jun 2024
OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure
OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure
Jikai Wang
Yi Su
Juntao Li
Qingrong Xia
Zi Ye
Xinyu Duan
Zhefeng Wang
Min Zhang
435
34
0
25 Jun 2024
Speeding Up Image Classifiers with Little Companions
Speeding Up Image Classifiers with Little Companions
Yang Liu
Kowshik Thopalli
Jayaraman J. Thiagarajan
VLM
269
0
0
24 Jun 2024
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
Yuhui Li
Fangyun Wei
Chao Zhang
Hongyang R. Zhang
406
188
0
24 Jun 2024
From Decoding to Meta-Generation: Inference-time Algorithms for Large
  Language Models
From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
Sean Welleck
Amanda Bertsch
Matthew Finlayson
Hailey Schoelkopf
Alex Xie
Graham Neubig
Ilia Kulikov
Zaid Harchaoui
374
110
0
24 Jun 2024
Towards Fast Multilingual LLM Inference: Speculative Decoding and
  Specialized Drafters
Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters
Euiin Yi
Taehyeon Kim
Hongseok Jeung
Du-Seong Chang
Se-Young Yun
178
7
0
24 Jun 2024
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph
Roman Vashurin
Ekaterina Fadeeva
Artem Vazhentsev
Akim Tsvigun
Daniil Vasilev
...
Timothy Baldwin
Timothy Baldwin
Preslav Nakov
Maxim Panov
Artem Shelmanov
HILM
683
61
0
21 Jun 2024
LiveMind: Low-latency Large Language Models with Simultaneous Inference
LiveMind: Low-latency Large Language Models with Simultaneous Inference
Chuangtao Chen
Grace Li Zhang
Xunzhao Yin
Cheng Zhuo
Ulf Schlichtmann
Bing Li
LRM
322
9
0
20 Jun 2024
Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving
Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving
Ke Cheng
Wen Hu
Zhi Wang
Hongen Peng
Jianguo Li
Sheng Zhang
180
14
0
19 Jun 2024
Fast and Slow Generating: An Empirical Study on Large and Small Language
  Models Collaborative Decoding
Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding
Kaiyan Zhang
Jianyu Wang
Ning Ding
Biqing Qi
Ermo Hua
Xingtai Lv
Bowen Zhou
355
14
0
18 Jun 2024
CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models
CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models
Yuetai Li
Zhangchen Xu
Fengqing Jiang
Luyao Niu
D. Sahabandu
Bhaskar Ramasubramanian
Radha Poovendran
SILMAAML
515
15
0
18 Jun 2024
Promises, Outlooks and Challenges of Diffusion Language Modeling
Promises, Outlooks and Challenges of Diffusion Language Modeling
Justin Deschenaux
Çağlar Gülçehre
DiffM
310
4
0
17 Jun 2024
On Giant's Shoulders: Effortless Weak to Strong by Dynamic Logits Fusion
On Giant's Shoulders: Effortless Weak to Strong by Dynamic Logits Fusion
Chenghao Fan
Zhenyi Lu
Wei Wei
Jie Tian
Xiaoye Qu
Dangyang Chen
Yu Cheng
MoMe
335
10
0
17 Jun 2024
Optimized Speculative Sampling for GPU Hardware Accelerators
Optimized Speculative Sampling for GPU Hardware Accelerators
Dominik Wagner
Seanie Lee
Ilja Baumann
Philipp Seeberger
Korbinian Riedhammer
Tobias Bocklet
216
4
0
16 Jun 2024
New Solutions on LLM Acceleration, Optimization, and Application
New Solutions on LLM Acceleration, Optimization, and Application
Yingbing Huang
Lily Jiaxin Wan
Hanchen Ye
Manvi Jha
Jinghua Wang
Yuhong Li
Xiaofan Zhang
Deming Chen
287
21
0
16 Jun 2024
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim
Karl Pertsch
Siddharth Karamcheti
Ted Xiao
Ashwin Balakrishna
...
Russ Tedrake
Dorsa Sadigh
Sergey Levine
Percy Liang
Chelsea Finn
LM&RoVLM
607
1,379
0
13 Jun 2024
Chain of Preference Optimization: Improving Chain-of-Thought Reasoning
  in LLMs
Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs
Xuan Zhang
Chao Du
Tianyu Pang
Qian Liu
Wei Gao
Min Lin
LRMAI4CE
292
121
0
13 Jun 2024
Previous
123...101112...141516
Next
Page 11 of 16
Pageof 16