ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2211.17192
  4. Cited By
Fast Inference from Transformers via Speculative Decoding
v1v2 (latest)

Fast Inference from Transformers via Speculative Decoding

International Conference on Machine Learning (ICML), 2022
30 November 2022
Yaniv Leviathan
Matan Kalman
Yossi Matias
    LRM
ArXiv (abs)PDFHTMLHuggingFace (9 upvotes)

Papers citing "Fast Inference from Transformers via Speculative Decoding"

50 / 763 papers shown
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
Gleb Rodionov
Roman Garipov
Alina Shutova
George Yakushev
Erik Schultheis
Vage Egiazarian
Anton Sinitsin
Denis Kuznedelev
Dan Alistarh
LRM
524
20
0
08 Apr 2025
SPIRe: Boosting LLM Inference Throughput with Speculative Decoding
SPIRe: Boosting LLM Inference Throughput with Speculative Decoding
Sanjit Neelam
Daniel Heinlein
Vaclav Cvicek
Akshay Mishra
Reiner Pope
LRM
174
0
0
08 Apr 2025
DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding
DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding
Hossein Entezari Zarch
Lei Gao
Chaoyi Jiang
Murali Annavaram
LRM
295
3
0
08 Apr 2025
SpecPipe: Accelerating Pipeline Parallelism-based LLM Inference with Speculative Decoding
SpecPipe: Accelerating Pipeline Parallelism-based LLM Inference with Speculative Decoding
Haofei Yin
Mengbai Xiao
Rouzhou Lu
Xiao Zhang
Dongxiao Yu
Guanghui Zhang
AI4CE
356
1
0
05 Apr 2025
SLOs-Serve: Optimized Serving of Multi-SLO LLMs
SLOs-Serve: Optimized Serving of Multi-SLO LLMs
Siyuan Chen
Zhipeng Jia
S. Khan
Arvind Krishnamurthy
Phillip B. Gibbons
239
9
0
05 Apr 2025
Scaling Test-Time Inference with Policy-Optimized, Dynamic Retrieval-Augmented Generation via KV Caching and Decoding
Scaling Test-Time Inference with Policy-Optimized, Dynamic Retrieval-Augmented Generation via KV Caching and Decoding
Sakhinana Sagar Srinivas
Akash Das
Shivam Gupta
Venkataramana Runkana
OffRL
496
2
0
02 Apr 2025
Efficient Construction of Model Family through Progressive Training Using Model Expansion
Efficient Construction of Model Family through Progressive Training Using Model Expansion
Kazuki Yano
Sho Takase
Sosuke Kobayashi
Shun Kiyono
Jun Suzuki
242
3
0
01 Apr 2025
Collaborative LLM Numerical Reasoning with Local Data Protection
Collaborative LLM Numerical Reasoning with Local Data Protection
Min Zhang
Yuzhe Lu
Yun Zhou
Panpan Xu
Lin Lee Cheong
Chang-Tien Lu
Haozhu Wang
367
0
0
01 Apr 2025
Adaptive Layer-skipping in Pre-trained LLMs
Adaptive Layer-skipping in Pre-trained LLMs
Xuan Luo
Weizhi Wang
Xifeng Yan
976
11
0
31 Mar 2025
Model Hemorrhage and the Robustness Limits of Large Language Models
Model Hemorrhage and the Robustness Limits of Large Language Models
Ziyang Ma
Hui Yuan
Guang Dai
Gui-Song Xia
Bo Du
Liangpei Zhang
Dacheng Tao
317
1
0
31 Mar 2025
Speculative End-Turn Detector for Efficient Speech Chatbot Assistant
Speculative End-Turn Detector for Efficient Speech Chatbot Assistant
Hyunjong Ok
Suho Yoo
Jaeho Lee
304
2
0
30 Mar 2025
FastVAR: Linear Visual Autoregressive Modeling via Cached Token Pruning
FastVAR: Linear Visual Autoregressive Modeling via Cached Token Pruning
Hang Guo
Yawei Li
Taolin Zhang
Jiadong Wang
Tao Dai
Shu-Tao Xia
Luca Benini
438
16
0
30 Mar 2025
Token-Driven GammaTune: Adaptive Calibration for Enhanced Speculative Decoding
Token-Driven GammaTune: Adaptive Calibration for Enhanced Speculative Decoding
Aayush Gautam
Susav Shrestha
Narasimha Annapareddy
489
2
0
28 Mar 2025
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action ModelsComputer Vision and Pattern Recognition (CVPR), 2025
Qingqing Zhao
Yao Lu
Moo Jin Kim
Zipeng Fu
Zhuoyang Zhang
...
Ankur Handa
Xuan Li
Donglai Xiang
Gordon Wetzstein
Nayeon Lee
LM&RoLRM
354
198
0
27 Mar 2025
Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence
Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence
Yijiong Yu
LRMAIMat
367
3
0
26 Mar 2025
PCM : Picard Consistency Model for Fast Parallel Sampling of Diffusion Models
PCM : Picard Consistency Model for Fast Parallel Sampling of Diffusion ModelsComputer Vision and Pattern Recognition (CVPR), 2025
Junhyuk So
Jiwoong Shin
Chaeyeon Jang
Eunhyeok Park
DiffM
339
0
0
25 Mar 2025
BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache
BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache
Dayou Du
Shijie Cao
Jianyi Cheng
Ting Cao
M. Yang
Mao Yang
MQ
878
1
0
24 Mar 2025
A Novel Hat-Shaped Device-Cloud Collaborative Inference Framework for Large Language Models
A Novel Hat-Shaped Device-Cloud Collaborative Inference Framework for Large Language Models
Zuan Xie
Yang Xu
Hongli Xu
Yunming Liao
Zhiwei Yao
362
4
0
23 Mar 2025
A Multi-Model Adaptation of Speculative Decoding for Classification
A Multi-Model Adaptation of Speculative Decoding for Classification
Somnath Roy
Padharthi Sreekar
Srivatsa Narasimha
Anubhav Anand
242
0
0
23 Mar 2025
SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs
SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs
Shibo Jie
Yehui Tang
Kai Han
Zhi-Hong Deng
Jing Han
304
4
0
20 Mar 2025
Prada: Black-Box LLM Adaptation with Private Data on Resource-Constrained Devices
Prada: Black-Box LLM Adaptation with Private Data on Resource-Constrained Devices
Liang Luo
Bowei Tian
Sihan Chen
Yu Li
Zheyu Shen
Myungjin Lee
Ang Li
247
2
0
19 Mar 2025
Growing a Twig to Accelerate Large Vision-Language Models
Growing a Twig to Accelerate Large Vision-Language Models
Zhenwei Shao
Mingyang Wang
Zhou Yu
Wenwen Pan
Yan Yang
Tao Wei
Hao Zhang
Ning Mao
Wei Chen
Jun Yu
VLM
361
6
0
18 Mar 2025
Decision Tree Induction Through LLMs via Semantically-Aware Evolution
Decision Tree Induction Through LLMs via Semantically-Aware EvolutionInternational Conference on Learning Representations (ICLR), 2025
Tennison Liu
Nicolas Huynh
M. Schaar
212
7
0
18 Mar 2025
Speculative Decoding for Verilog: Speed and Quality, All in One
Speculative Decoding for Verilog: Speed and Quality, All in OneDesign Automation Conference (DAC), 2025
Changran Xu
Yi Liu
Yunhao Zhou
Shan Huang
Ningyi Xu
Qiang Xu
181
1
0
18 Mar 2025
ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts
ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts
E. Georganas
Dhiraj D. Kalamkar
Alexander Kozlov
A. Heinecke
MQ
932
4
0
17 Mar 2025
G-Boost: Boosting Private SLMs with General LLMs
Yijiang Fan
Yuren Mao
Longbin Lai
Ying Zhang
Zhengping Qian
Yunjun Gao
165
2
0
13 Mar 2025
Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding
Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding
Jiajun Li
Yixing Xu
Haiduo Huang
Xuanwu Yin
D. Li
Edith C. -H. Ngai
E. Barsoum
426
4
0
13 Mar 2025
Collaborative Speculative Inference for Efficient LLM Inference Serving
Luyao Gao
Jianchun Liu
Hongli Xu
Xichong Zhang
Yunming Liao
Liusheng Huang
329
3
0
13 Mar 2025
Language Models, Graph Searching, and Supervision Adulteration: When More Supervision is Less and How to Make More More
Language Models, Graph Searching, and Supervision Adulteration: When More Supervision is Less and How to Make More MoreAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Arvid Frydenlund
LRM
559
2
0
13 Mar 2025
Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency
Position-Aware Depth Decay Decoding (D3D^3D3): Boosting Large Language Model Inference EfficiencyAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Siqi Fan
Xuezhi Fang
Xingrun Xing
Peng Han
Shuo Shang
Yequan Wang
380
0
0
11 Mar 2025
Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference
Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM InferenceIEEE International Conference on Cloud Computing (CLOUD), 2025
Pol G. Recasens
Ferran Agullo
Yue Zhu
Chen Wang
Eun Kyung Lee
Olivier Tardieu
Jordi Torres
Josep Ll. Berral
266
21
0
11 Mar 2025
Queueing, Predictions, and LLMs: Challenges and Open Problems
Michael Mitzenmacher
Rana Shahout
AI4TSLRM
213
4
0
10 Mar 2025
Training Domain Draft Models for Speculative Decoding: Best Practices and Insights
Training Domain Draft Models for Speculative Decoding: Best Practices and Insights
Fenglu Hong
Ravi Raju
Jonathan Li
Bo Li
Urmish Thakker
Avinash Ravichandran
Swayambhoo Jain
Changran Hu
326
4
0
10 Mar 2025
DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs
DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs
Jongwoo Ko
Tianyi Chen
Sungnyun Kim
Tianyu Ding
Luming Liang
Ilya Zharkov
Se-Young Yun
VLM
1.0K
17
0
10 Mar 2025
Beyond Decoder-only: Large Language Models Can be Good Encoders for Machine Translation
Beyond Decoder-only: Large Language Models Can be Good Encoders for Machine TranslationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Yingfeng Luo
Tong Zheng
Yongyu Mu
Yangqiu Song
Qinghong Zhang
...
Ziqiang Xu
Peinan Feng
Xiaoqian Liu
Tong Xiao
Jingbo Zhu
AI4CE
1.1K
9
0
09 Mar 2025
Exploiting Edited Large Language Models as General Scientific Optimizers
Exploiting Edited Large Language Models as General Scientific OptimizersNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025
Qitan Lv
T. Liu
Haoyu Wang
428
5
0
08 Mar 2025
AdaSpec: Adaptive Speculative Decoding for Fast, SLO-Aware Large Language Model Serving
AdaSpec: Adaptive Speculative Decoding for Fast, SLO-Aware Large Language Model Serving
Kaiyu Huang
Yu Wang
Zhubo Shi
Han Zou
Minchen Yu
Qingjiang Shi
LRM
295
10
0
07 Mar 2025
Speculative Decoding for Multi-Sample Inference
Yiwei Li
Jiayi Shi
Shaoxiong Feng
Peiwen Yuan
Xinyu Wang
...
Ji Zhang
Chuyi Tan
Boyuan Pan
Yao Hu
Kan Li
LRM
291
2
0
07 Mar 2025
DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models
DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models
Ruizhe Chen
Wenhao Chai
Zhifei Yang
Xiaotian Zhang
Qiufeng Wang
Tony Q.S. Quek
Soujanya Poria
Zuozhu Liu
537
3
0
06 Mar 2025
Balcony: A Lightweight Approach to Dynamic Inference of Generative Language Models
Benyamin Jamialahmadi
Parsa Kavehzadeh
Mehdi Rezagholizadeh
Parsa Farinneya
Hossein Rajabzadeh
A. Jafari
Boxing Chen
Marzieh S. Tahaei
269
2
0
06 Mar 2025
RASD: Retrieval-Augmented Speculative DecodingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Guofeng Quan
Wenfeng Feng
Chuzhan Hao
Guochao Jiang
Yuewei Zhang
Hao Wang
RALM
378
2
0
05 Mar 2025
FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference
Hongchao Du
Shangyu Wu
Arina Kharlamova
Nan Guan
Chun Jason Xue
235
4
0
04 Mar 2025
EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
Yuhui Li
Fangyun Wei
Chao Zhang
Hongyang R. Zhang
598
89
0
03 Mar 2025
Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding
Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding
Yun Wang
Pei Zhang
Siyuan Huang
Baosong Yang
Zizhuo Zhang
Fei Huang
Rui Wang
BDLLRM
497
35
0
03 Mar 2025
DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting
Kai Lv
Honglin Guo
Qipeng Guo
Xipeng Qiu
304
1
0
02 Mar 2025
Tutorial Proposal: Speculative Decoding for Efficient LLM Inference
Heming Xia
Cunxiao Du
Yongqian Li
Qian Liu
Wenjie Li
300
2
0
01 Mar 2025
Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff
Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime TradeoffAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Maximilian Holsman
Yukun Huang
Bhuwan Dhingra
582
4
0
28 Feb 2025
RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding
RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding
Guanzheng Chen
Qilong Feng
Jinjie Ni
Xin Li
Michael Shieh
RALM
394
5
0
27 Feb 2025
Speculative Decoding and Beyond: An In-Depth Survey of Techniques
Speculative Decoding and Beyond: An In-Depth Survey of Techniques
Y. Hu
Zining Liu
Zhenyuan Dong
Tianfan Peng
Bradley McDanel
Shanghang Zhang
740
0
0
27 Feb 2025
Gatekeeper: Improving Model Cascades Through Confidence Tuning
Gatekeeper: Improving Model Cascades Through Confidence Tuning
Stephan Rabanser
Nathalie Rauschmayr
Achin Kulshrestha
Petra Poklukar
Wittawat Jitkrittum
Sean Augenstein
Congchao Wang
Federico Tombari
479
1
0
26 Feb 2025
Previous
123...678...141516
Next