ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2211.17192
  4. Cited By
Fast Inference from Transformers via Speculative Decoding
v1v2 (latest)

Fast Inference from Transformers via Speculative Decoding

International Conference on Machine Learning (ICML), 2022
30 November 2022
Yaniv Leviathan
Matan Kalman
Yossi Matias
    LRM
ArXiv (abs)PDFHTMLHuggingFace (9 upvotes)

Papers citing "Fast Inference from Transformers via Speculative Decoding"

50 / 763 papers shown
TagRouter: Learning Route to LLMs through Tags for Open-Domain Text Generation Tasks
TagRouter: Learning Route to LLMs through Tags for Open-Domain Text Generation TasksAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Zhou Chen
Zhiqiang Wei
Yuqi Bai
Xue Xiong
Jianmin Wu
3DV
181
6
0
14 Jun 2025
Semantic Scheduling for LLM Inference
Semantic Scheduling for LLM Inference
Qingfeng Lan
Dujian Ding
Yile Gu
Yujie Ren
Kai Mei
Minghua Ma
William Yang Wang
152
0
0
13 Jun 2025
Cascaded Language Models for Cost-effective Human-AI Decision-Making
Cascaded Language Models for Cost-effective Human-AI Decision-Making
Claudio Fanconi
M. Schaar
345
1
0
13 Jun 2025
Efficient LLM Collaboration via Planning
Efficient LLM Collaboration via Planning
Byeongchan Lee
Jonghoon Lee
Dongyoung Kim
Jaehyung Kim
Kyungjoon Park
Dongjun Lee
Jinwoo Shin
LRM
242
3
0
13 Jun 2025
SwiftSpec: Ultra-Low Latency LLM Decoding by Scaling Asynchronous Speculative Decoding
SwiftSpec: Ultra-Low Latency LLM Decoding by Scaling Asynchronous Speculative Decoding
Ziyi Zhang
Ziheng Jiang
Chengquan Jiang
Menghan Yu
Size Zheng
H. Lin
Henry Hoffmann
Xin Liu
230
3
0
12 Jun 2025
From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring
From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring
Yang Li
Qiang Sheng
Yehan Yang
Xueyao Zhang
Juan Cao
339
7
0
11 Jun 2025
SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving
SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving
Xiangchen Li
Dimitrios Spatharakis
Saeid Ghafouri
Jiakun Fan
Dimitrios Nikolopoulos
Deepu John
Bo Ji
Dimitrios S. Nikolopoulos
411
6
0
11 Jun 2025
On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention
On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention
Yeonju Ro
Zhenyu Zhang
Souvik Kundu
Zhangyang Wang
Aditya Akella
420
2
0
11 Jun 2025
Brevity is the soul of sustainability: Characterizing LLM response lengthsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
S. Poddar
Paramita Koley
Janardan Misra
Sanjay Podder
Navveen Balani
Niloy Ganguly
Saptarshi Ghosh
244
5
0
10 Jun 2025
ADAM: Autonomous Discovery and Annotation Model using LLMs for Context-Aware Annotations
Amirreza Rouhi
Solmaz Arezoomandan
Knut Peterson
Joseph T. Woods
David Han
VLM
199
11
0
10 Jun 2025
SeerAttention-R: Sparse Attention Adaptation for Long Reasoning
Yizhao Gao
Shuming Guo
Shijie Cao
Yuqing Xia
Yu Cheng
...
Hayden Kwok-Hay So
Yu Hua
Ting Cao
Fan Yang
Mao Yang
VLMLRM
218
9
0
10 Jun 2025
LiteVLM: A Low-Latency Vision-Language Model Inference Pipeline for Resource-Constrained Environments
LiteVLM: A Low-Latency Vision-Language Model Inference Pipeline for Resource-Constrained Environments
Jin Huang
Yuchao Jin
Le An
Josh Park
VLM
185
3
0
09 Jun 2025
MiniCPM4: Ultra-Efficient LLMs on End Devices
MiniCPM4: Ultra-Efficient LLMs on End Devices
MiniCPM Team
Chaojun Xiao
Yuxuan Li
Xu Han
Yuzhuo Bai
...
Zhiyuan Liu
Guoyang Zeng
Chao Jia
Dahai Li
Maosong Sun
MLLM
315
21
0
09 Jun 2025
Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong Decoding
Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong DecodingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Feifan Song
Shaohang Wei
Wen Luo
Yuxuan Fan
Tianyu Liu
Guoyin Wang
Houfeng Wang
216
4
0
09 Jun 2025
Fast ECoT: Efficient Embodied Chain-of-Thought via Thoughts Reuse
Fast ECoT: Efficient Embodied Chain-of-Thought via Thoughts Reuse
Zhekai Duan
Yuan Zhang
Shikai Geng
Gaowen Liu
Joschka Boedecker
Chris Xiaoxuan Lu
LRM
277
11
0
09 Jun 2025
Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit
Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit
Charles Goddard
Fernando Fernandes Neto
166
2
0
07 Jun 2025
Spark Transformer: Reactivating Sparsity in FFN and Attention
Spark Transformer: Reactivating Sparsity in FFN and Attention
Chong You
Kan Wu
Zhipeng Jia
Lin Chen
Srinadh Bhojanapalli
...
Felix X. Yu
Prateek Jain
David Culler
Henry M. Levy
Sanjiv Kumar
231
2
0
07 Jun 2025
Inference economics of language models
Ege Erdil
230
7
0
05 Jun 2025
Kinetics: Rethinking Test-Time Scaling Laws
Kinetics: Rethinking Test-Time Scaling Laws
Ranajoy Sadhukhan
Zhuoming Chen
Haizhong Zheng
Yang Zhou
Emma Strubell
Beidi Chen
458
6
0
05 Jun 2025
List-Level Distribution Coupling with Applications to Speculative Decoding and Lossy Compression
List-Level Distribution Coupling with Applications to Speculative Decoding and Lossy Compression
Joseph Rowan
Buu Phan
Ashish Khisti
304
0
0
05 Jun 2025
Accelerated Test-Time Scaling with Model-Free Speculative Sampling
Accelerated Test-Time Scaling with Model-Free Speculative Sampling
Woomin Song
Saket Dingliwal
Sai Muralidhar Jayanthi
Bhavana Ganesh
Jinwoo Shin
Aram Galstyan
S. Bodapati
LRM
356
2
0
05 Jun 2025
AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism
AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism
Zhepei Wei
Wei-Lin Chen
Xinyu Zhu
Yu Meng
OffRL
311
3
0
04 Jun 2025
Rectified Sparse Attention
Rectified Sparse Attention
Yutao Sun
Tianzhu Ye
Li Dong
Yuqing Xia
Jian Chen
Yizhao Gao
S. Cao
Jianyong Wang
Furu Wei
300
5
0
04 Jun 2025
QA-HFL: Quality-Aware Hierarchical Federated Learning for Resource-Constrained Mobile Devices with Heterogeneous Image Quality
QA-HFL: Quality-Aware Hierarchical Federated Learning for Resource-Constrained Mobile Devices with Heterogeneous Image Quality
Sajid Hussain
Muhammad Sohail
Nauman Ali Khan
382
4
0
04 Jun 2025
The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective
The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective
Jiin Kim
Byeongjun Shin
Jinha Chung
Minsoo Rhu
LLMAGLRM
346
12
0
04 Jun 2025
Guided Speculative Inference for Efficient Test-Time Alignment of LLMs
Guided Speculative Inference for Efficient Test-Time Alignment of LLMs
Jonathan Geuter
Youssef Mroueh
David Alvarez-Melis
389
1
0
04 Jun 2025
Learning to Insert [PAUSE] Tokens for Better Reasoning
Learning to Insert [PAUSE] Tokens for Better ReasoningAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Eunki Kim
Sangryul Kim
James Thorne
LRM
318
3
0
04 Jun 2025
Out-of-Vocabulary Sampling Boosts Speculative Decoding
Out-of-Vocabulary Sampling Boosts Speculative Decoding
Nadav Timor
Jonathan Mamou
Oren Pereg
Hongyang Zhang
David Harel
OODD
138
1
0
02 Jun 2025
Mamba Drafters for Speculative Decoding
Mamba Drafters for Speculative Decoding
Daewon Choi
Seunghyuk Oh
Saket Dingliwal
Jihoon Tack
Kyuyoung Kim
...
Insu Han
Jinwoo Shin
Aram Galstyan
Shubham Katiyar
S. Bodapati
290
0
0
01 Jun 2025
CLaSp: In-Context Layer Skip for Self-Speculative Decoding
CLaSp: In-Context Layer Skip for Self-Speculative DecodingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Longze Chen
Renke Shan
Huiming Wang
Lu Wang
Ziqiang Liu
Run Luo
Jiawei Wang
Hamid Alinejad-Rokny
Min Yang
154
2
0
30 May 2025
RAST: Reasoning Activation in LLMs via Small-model Transfer
RAST: Reasoning Activation in LLMs via Small-model Transfer
Siru Ouyang
Xinyu Zhu
Zilin Xiao
Minhao Jiang
Yu Meng
Jiawei Han
OffRLReLMLRM
256
1
0
30 May 2025
FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference
FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference
Aniruddha Nrusimha
William Brandon
Mayank Mishra
Yikang Shen
Rameswar Panda
Jonathan Ragan-Kelley
Yoon Kim
VLM
216
1
0
28 May 2025
Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design
Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design
Yudi Zhang
Weilin Zhao
Xu Han
Tiejun Zhao
Wang Xu
Hailong Cao
Conghui Zhu
MQ
370
1
0
28 May 2025
RAD: Redundancy-Aware Distillation for Hybrid Models via Self-Speculative Decoding
RAD: Redundancy-Aware Distillation for Hybrid Models via Self-Speculative Decoding
Yuichiro Hoshino
Hideyuki Tachibana
Muneyoshi Inahara
Hiroto Takegawa
288
1
0
28 May 2025
Fast and Cost-effective Speculative Edge-Cloud Decoding with Early Exits
Fast and Cost-effective Speculative Edge-Cloud Decoding with Early Exits
Yeshwanth Venkatesha
Souvik Kundu
Priyadarshini Panda
166
6
0
27 May 2025
Efficient Large Language Model Inference with Neural Block Linearization
Efficient Large Language Model Inference with Neural Block Linearization
Mete Erdogan
F. Tonin
Volkan Cevher
365
1
0
27 May 2025
SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences
SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences
Jungyoub Cha
Hyunjong Kim
Sungzoon Cho
VLM
334
0
0
27 May 2025
Faster and Better LLMs via Latency-Aware Test-Time Scaling
Faster and Better LLMs via Latency-Aware Test-Time Scaling
Zili Wang
Tianyu Zhang
Haoli Bai
Lu Hou
Xianzhi Yu
Wulong Liu
Shiming Xiang
Lei Zhu
LRM
377
7
0
26 May 2025
Do Large Language Models (Really) Need Statistical Foundations?
Do Large Language Models (Really) Need Statistical Foundations?
Weijie Su
625
4
0
25 May 2025
Plug-and-Play Context Feature Reuse for Efficient Masked Generation
Plug-and-Play Context Feature Reuse for Efficient Masked Generation
Xuejie Liu
Anji Liu
Karen Ullrich
Yitao Liang
266
3
0
25 May 2025
Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs
Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Xuan Zhang
Cunxiao Du
Sicheng Yu
Jiawei Wu
Fengzhuo Zhang
Wei Gao
Qian Liu
234
1
0
25 May 2025
System-1.5 Reasoning: Traversal in Language and Latent Spaces with Dynamic Shortcuts
System-1.5 Reasoning: Traversal in Language and Latent Spaces with Dynamic Shortcuts
Xiaoqiang Wang
Suyuchen Wang
Yun Zhu
Bang Liu
ReLMLRM
401
6
0
25 May 2025
Inference Compute-Optimal Video Vision Language Models
Inference Compute-Optimal Video Vision Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Peiqi Wang
ShengYun Peng
Xuewen Zhang
Hanchao Yu
Yibo Yang
Lifu Huang
Fujun Liu
Qifan Wang
VLM
277
2
0
24 May 2025
Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding
Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding
Yixuan Wang
Yijun Liu
Shiyu Ji
Yuzhuang Xu
Yang Xu
Qingfu Zhu
Wanxiang Che
OffRLLRM
300
1
0
24 May 2025
VeriThinker: Learning to Verify Makes Reasoning Model Efficient
Zigeng Chen
Xinyin Ma
Gongfan Fang
Ruonan Yu
Xinchao Wang
LRM
328
15
0
23 May 2025
L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models
L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models
Xiaohao Liu
Xiaobo Xia
Weixiang Zhao
Manyi Zhang
Xianzhi Yu
Xiu Su
Shuo Yang
See-Kiong Ng
Tat-Seng Chua
KELMLRM
411
3
0
23 May 2025
Thought calibration: Efficient and confident test-time scaling
Thought calibration: Efficient and confident test-time scaling
Menghua Wu
Cai Zhou
Stephen Bates
Tommi Jaakkola
LRM
294
3
0
23 May 2025
KNN-SSD: Enabling Dynamic Self-Speculative Decoding via Nearest Neighbor Layer Set Optimization
KNN-SSD: Enabling Dynamic Self-Speculative Decoding via Nearest Neighbor Layer Set Optimization
Mingbo Song
Heming Xia
Jun Zhang
Chak Tou Leong
Qiancheng Xu
Wenjie Li
Sujian Li
191
1
0
22 May 2025
LLM-Based Emulation of the Radio Resource Control Layer: Towards AI-Native RAN Protocols
LLM-Based Emulation of the Radio Resource Control Layer: Towards AI-Native RAN Protocols
Ziming Liu
Bryan Liu
Alvaro Valcarce
Xiaoli Chu
365
3
0
22 May 2025
Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding
Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding
Zijian Lin
Yang Zhang
Yougen Yuan
Yuming Yan
Jinjiang Liu
Zhiyong Wu
Pengfei Hu
Qun Yu
319
1
0
21 May 2025
Previous
123456...141516
Next
Page 5 of 16
Pageof 16