ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2211.17192
  4. Cited By
Fast Inference from Transformers via Speculative Decoding
v1v2 (latest)

Fast Inference from Transformers via Speculative Decoding

International Conference on Machine Learning (ICML), 2022
30 November 2022
Yaniv Leviathan
Matan Kalman
Yossi Matias
    LRM
ArXiv (abs)PDFHTMLHuggingFace (9 upvotes)

Papers citing "Fast Inference from Transformers via Speculative Decoding"

50 / 763 papers shown
Towards On-Device Personalization: Cloud-device Collaborative Data Augmentation for Efficient On-device Language Model
Towards On-Device Personalization: Cloud-device Collaborative Data Augmentation for Efficient On-device Language Model
Zhaofeng Zhong
Wei Yuan
Liang Qu
Tong Chen
Hao Wang
Xiangyu Zhao
Hongzhi Yin
136
1
0
29 Aug 2025
Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution
Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution
Chen Chen
Yuchen Sun
Jiaxin Gao
Xueluan Gong
Qian-Wei Wang
Ziyao Wang
Yongsen Zheng
K. Lam
AAMLKELM
160
0
0
28 Aug 2025
History Rhymes: Accelerating LLM Reinforcement Learning with RhymeRL
History Rhymes: Accelerating LLM Reinforcement Learning with RhymeRL
Jingkai He
Tianjian Li
Erhu Feng
Dong Du
Qian Liu
Tao Liu
Yubin Xia
Haibo Chen
149
16
0
26 Aug 2025
Speculative Safety-Aware Decoding
Speculative Safety-Aware Decoding
Xuekang Wang
Shengyu Zhu
Xueqi Cheng
174
0
0
25 Aug 2025
SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning
SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning
Yicheng Ji
Jun Zhang
Heming Xia
Jinpeng Chen
Lidan Shou
Gang Chen
Huan Li
VLM
241
3
0
22 Aug 2025
Hardwired-Neurons Language Processing Units as General-Purpose Cognitive Substrates
Hardwired-Neurons Language Processing Units as General-Purpose Cognitive Substrates
Wenshu Fan
Yi-Ling Chen
Yongwei Zhao
Y. Hao
Zifu Zheng
...
Zidong Du
Zhiwei Xu
Qi Guo
Tianshi Chen
Yunji Chen
120
0
0
22 Aug 2025
GPT-OSS-20B: A Comprehensive Deployment-Centric Analysis of OpenAI's Open-Weight Mixture of Experts Model
GPT-OSS-20B: A Comprehensive Deployment-Centric Analysis of OpenAI's Open-Weight Mixture of Experts Model
Deepak Kumar
Divakar Yadav
Yash Patel
MoE
193
3
0
22 Aug 2025
Confidence-Modulated Speculative Decoding for Large Language Models
Confidence-Modulated Speculative Decoding for Large Language Models
Jaydip Sen
Subhasis Dasgupta
Hetvi Waghela
UQLM
295
1
0
21 Aug 2025
WISCA: A Lightweight Model Transition Method to Improve LLM Training via Weight Scaling
WISCA: A Lightweight Model Transition Method to Improve LLM Training via Weight Scaling
Jiacheng Li
Jianchao Tan
Zhidong Yang
Pingwei Sun
Feiye Huo
...
Xiangyu Zhang
Maoxin He
Guangming Tan
Weile Jia
Tong Zhao
113
3
0
21 Aug 2025
Measuring the environmental impact of delivering AI at Google Scale
Measuring the environmental impact of delivering AI at Google Scale
Cooper Elsworth
Keguo Huang
David Patterson
Ian Schneider
Robert Sedivy
...
Parthasarathy Ranganathan
J. Dean
Amin Vahdat
Ben Gomes
James Manyika
112
14
0
21 Aug 2025
Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner
Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner
Bolian Li
Yanran Wu
Xinyu Luo
Ruqi Zhang
246
2
0
20 Aug 2025
A Comparative Study of Decoding Strategies in Medical Text Generation
A Comparative Study of Decoding Strategies in Medical Text Generation
Oriana Presacan
Alireza Nik
Vajira Thambawita
Bogdan Ionescu
Michael A. Riegler
LM&MA
125
0
0
19 Aug 2025
Energy-Efficient Wireless LLM Inference via Uncertainty and Importance-Aware Speculative Decoding
Energy-Efficient Wireless LLM Inference via Uncertainty and Importance-Aware Speculative Decoding
Jihoon Park
Seungeun Oh
Seong-Lyun Kim
98
1
0
18 Aug 2025
Cost-Aware Contrastive Routing for LLMs
Cost-Aware Contrastive Routing for LLMs
Reza Shirkavand
Shangqian Gao
Qi He
Heng-Chiao Huang
313
1
0
17 Aug 2025
Dynamic Quality-Latency Aware Routing for LLM Inference in Wireless Edge-Device Networks
Dynamic Quality-Latency Aware Routing for LLM Inference in Wireless Edge-Device Networks
Rui Bao
Nan Xue
Yaping Sun
Zhiyong Chen
75
1
0
15 Aug 2025
Novel Parasitic Dual-Scale Modeling for Efficient and Accurate Multilingual Speech Translation
Novel Parasitic Dual-Scale Modeling for Efficient and Accurate Multilingual Speech Translation
Chenyang Le
Yinfeng Xia
Huiyan Li
Manhong Wang
Yutao Sun
Xingyang Ma
Yanmin Qian
88
0
0
15 Aug 2025
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
NextStep Team
Chunrui Han
Guopeng Li
J. Wu
Quan Sun
...
Ziyang Meng
Binxing Jiao
Daxin Jiang
X. Zhang
Yibo Zhu
DiffM
202
22
0
14 Aug 2025
READER: Retrieval-Assisted Drafter for Efficient LLM Inference
READER: Retrieval-Assisted Drafter for Efficient LLM Inference
Maxim Divilkovskiy
Vitaly Malygin
Sergey Zlobin
Sultan Isali
Vasily Kalugin
Stanislav Ilyushin
Nuriza Aitassova
Yi Fei
Zeng Weidi
RALM
163
0
0
12 Aug 2025
ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs
ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs
Keyu Chen
Zhifeng Shen
Daohai Yu
Haoqian Wu
Wei Wen
Jianfeng He
Ruizhi Qiao
Xing Sun
127
4
0
12 Aug 2025
Grouped Speculative Decoding for Autoregressive Image Generation
Grouped Speculative Decoding for Autoregressive Image Generation
Junhyuk So
Juncheol Shin
Hyunho Kook
Eunhyeok Park
DiffM
100
3
0
11 Aug 2025
OverFill: Two-Stage Models for Efficient Language Model Decoding
OverFill: Two-Stage Models for Efficient Language Model Decoding
Woojeong Kim
Junxiong Wang
Jing Nathan Yan
Mohamed S. Abdelfattah
Alexander M Rush
108
0
0
11 Aug 2025
Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions
Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions
Bangsheng Tang
Carl Chengyan Fu
Fei Kou
Grigory Sizov
Haoci Zhang
...
Vlad Mihailescu
Xingwen Guo
Yan Cui
Y. Hu
Yejin Lee
LRM
256
4
0
11 Aug 2025
Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation
Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation
Xutong Liu
Baran Atalar
Xiangxiang Dai
Jinhang Zuo
Siwei Wang
John C. S. Lui
Wei Chen
Carlee Joe-Wong
OffRL
188
0
0
11 Aug 2025
CARD: A Cache-Assisted Parallel Speculative Decoding Framework via Query-and-Correct Paradigm for Accelerating LLM Inference
CARD: A Cache-Assisted Parallel Speculative Decoding Framework via Query-and-Correct Paradigm for Accelerating LLM Inference
Enyu Zhou
Kai Sheng
Hao Chen
Xin He
LRM
174
0
0
06 Aug 2025
An Efficient and Adaptive Next Edit Suggestion Framework with Zero Human Instructions in IDEs
An Efficient and Adaptive Next Edit Suggestion Framework with Zero Human Instructions in IDEs
Xinfang Chen
Siyang Xiao
Xianying Zhu
Junhong Xie
Ming Liang
Dajun Chen
Wei Jiang
Yong Li
Peng Di
121
2
0
04 Aug 2025
SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference
SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference
Yi Zhao
Yajuan Peng
Cam-Tu Nguyen
Zuchao Li
Xiaoliang Wang
Hai Zhao
Xiaoming Fu
209
2
0
03 Aug 2025
Fast and scalable retrosynthetic planning with a transformer neural network and speculative beam search
Fast and scalable retrosynthetic planning with a transformer neural network and speculative beam search
Mikhail Andronov
Natalia Andronova
Michael Wand
J. Schmidhuber
Djork-Arné Clevert
91
2
0
02 Aug 2025
Optimal Scheduling Algorithms for LLM Inference: Theory and Practice
Optimal Scheduling Algorithms for LLM Inference: Theory and PracticeProceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS), 2025
Agrim Bari
Parikshit Hegde
G. Veciana
165
1
0
01 Aug 2025
XSpecMesh: Quality-Preserving Auto-Regressive Mesh Generation Acceleration via Multi-Head Speculative Decoding
XSpecMesh: Quality-Preserving Auto-Regressive Mesh Generation Acceleration via Multi-Head Speculative Decoding
Dian Chen
Yansong Qu
Xinyang Li
Ming Li
Shengchuan Zhang
221
2
0
31 Jul 2025
Spec-VLA: Speculative Decoding for Vision-Language-Action Models with Relaxed Acceptance
Spec-VLA: Speculative Decoding for Vision-Language-Action Models with Relaxed Acceptance
Songsheng Wang
Rucheng Yu
Zhihang Yuan
Chao Yu
Feng Gao
Yu-Ping Wang
Derek F. Wong
187
7
0
30 Jul 2025
Hierarchical Verification of Speculative Beams for Accelerating LLM Inference
Hierarchical Verification of Speculative Beams for Accelerating LLM Inference
Jaydip Sen
Harshitha Puvvala
Subhasis Dasgupta
165
2
0
30 Jul 2025
Model-free Speculative Decoding for Transformer-based ASR with Token Map Drafting
Model-free Speculative Decoding for Transformer-based ASR with Token Map Drafting
Tuan Vu Ho
Hiroaki Kokubo
Masaaki Yamamoto
Yohei Kawaguchi
102
0
0
29 Jul 2025
SpecASR: Accelerating LLM-based Automatic Speech Recognition via Speculative Decoding
SpecASR: Accelerating LLM-based Automatic Speech Recognition via Speculative DecodingDesign Automation Conference (DAC), 2025
Linye Wei
Shuzhang Zhong
Songqiang Xu
Runsheng Wang
Ru Huang
Meng Li
247
0
0
24 Jul 2025
GATEBLEED: Exploiting On-Core Accelerator Power Gating for High Performance & Stealthy Attacks on AI
GATEBLEED: Exploiting On-Core Accelerator Power Gating for High Performance & Stealthy Attacks on AI
Joshua Kalyanapu
Farshad Dizani
Darsh Asher
Azam Ghanbari
Rosario Cammarota
Aydin Aysu
Samira Mirbagher Ajorpaz
282
0
0
22 Jul 2025
STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models
STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models
Cheng-Han Chiang
Xiaofei Wang
Linjie Li
Chung-Ching Lin
Kevin Qinghong Lin
S. Liu
Zhendong Wang
Zhengyuan Yang
Hung-yi Lee
Lijuan Wang
ReLMLRM
141
10
0
21 Jul 2025
ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning
ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning
Zhengyue Zhao
Yingzi Ma
S. Jha
Marco Pavone
P. McDaniel
Chaowei Xiao
LRM
203
2
0
14 Jul 2025
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
Sangmin Bae
Yujin Kim
Reza Bayat
S. Kim
Jiyoun Ha
...
Adam Fisch
Hrayr Harutyunyan
Ziwei Ji
Aaron Courville
Se-Young Yun
MoE
296
25
0
14 Jul 2025
TPP-SD: Accelerating Transformer Point Process Sampling with Speculative Decoding
TPP-SD: Accelerating Transformer Point Process Sampling with Speculative Decoding
Shukai Gong
Yiyang Fu
Fengyuan Ran
Quyu Kong
Feng Zhou
177
0
0
12 Jul 2025
BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity
BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity
Chenyang Song
Weilin Zhao
Xu Han
Chaojun Xiao
Yingfa Chen
Yuxuan Li
Zhiyuan Liu
Maosong Sun
MoE
260
0
0
11 Jul 2025
OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding
OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding
R. Ramakrishnan
Zhaocong Yuan
Shaojie Zhuo
Chen Feng
Yicheng Lin
Chenzheng Su
Xiaopeng Zhang
SyDa
342
1
0
03 Jul 2025
Cautious Next Token Prediction
Cautious Next Token Prediction
Yizhou Wang
Lingzhi Zhang
Yue Bai
M. Chiu
Zhengmian Hu
M. Zhang
Qihua Dong
Yu Yin
Sohrab Amirghodsi
Y. Fu
225
2
0
03 Jul 2025
VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs
VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs
Raghavv Goel
Sudhanshu Agrawal
Mukul Gagrani
Junyoung Park
Yifan Zao
...
Y. Yang
Xin Yuan
Jiuyan Lu
Chris Lott
Mingu Lee
116
4
0
28 Jun 2025
Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models
Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models
Omer Luxembourg
Haim Permuter
Eliya Nachmani
DiffM
221
14
0
23 Jun 2025
PARALLELPROMPT: Extracting Parallelism from Large Language Model Queries
PARALLELPROMPT: Extracting Parallelism from Large Language Model Queries
Steven Kolawole
Keshav Santhanam
Virginia Smith
Pratiksha Thaker
LRM
140
1
0
23 Jun 2025
Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?
Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?
Adithya Bhaskar
Alexander Wettig
Tianyu Gao
Yihe Dong
Danqi Chen
169
4
0
20 Jun 2025
PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction
PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction
Shufan Li
Aditya Grover
245
3
0
18 Jun 2025
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents
Zijian Zhou
Ao Qu
Zhaoxuan Wu
Sunghwan Kim
Alok Prakash
Daniela Rus
Jinhua Zhao
Bryan Kian Hsiang Low
Paul Liang
LLMAGOffRLLRM
391
50
0
18 Jun 2025
S$^4$C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models
S4^44C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models
Tao He
Guang Huang
Yu Yang
Tianshi Xu
Sicheng Zhao
Guiguang Ding
Pengyang Wang
Feng Tian
LRM
198
0
0
17 Jun 2025
Multimodal Large Language Models-Enabled UAV Swarm: Towards Efficient and Intelligent Autonomous Aerial Systems
Multimodal Large Language Models-Enabled UAV Swarm: Towards Efficient and Intelligent Autonomous Aerial Systems
Yuqi Ping
Tianhao Liang
Yunpeng Song
Guangyu Lei
Junwei Wu
...
Rui Shao
Chiya Zhang
Weizheng Zhang
Weijie Yuan
Tingting Zhang
173
10
0
15 Jun 2025
$\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts
SPECS\texttt{SPECS}SPECS: Faster Test-Time Scaling through Speculative Drafts
Mert Cemri
Nived Rajaraman
Rishabh Tiwari
Xiaoxuan Liu
Kurt Keutzer
Ion Stoica
Kannan Ramchandran
Ahmad Beirami
Ziteng Sun
LRM
213
2
0
15 Jun 2025
Previous
12345...141516
Next