Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2211.17192
Cited By
v1
v2 (latest)
Fast Inference from Transformers via Speculative Decoding
International Conference on Machine Learning (ICML), 2022
30 November 2022
Yaniv Leviathan
Matan Kalman
Yossi Matias
LRM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (9 upvotes)
Papers citing
"Fast Inference from Transformers via Speculative Decoding"
50 / 763 papers shown
TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation
Tong Wu
Junzhe Shen
Zixia Jia
Yanjie Wang
Zilong Zheng
310
1
0
26 Feb 2025
Towards Optimal Multi-draft Speculative Decoding
International Conference on Learning Representations (ICLR), 2025
Zhibo Hu
Tong Zheng
Vignesh Viswanathan
Ziyi Chen
Ryan Rossi
Yihan Wu
Dinesh Manocha
Heng Huang
290
11
0
26 Feb 2025
CORAL: Learning Consistent Representations across Multi-step Training with Lighter Speculative Drafter
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Yepeng Weng
Dianwen Mei
Huishi Qiu
Xujie Chen
Li Liu
Jiang Tian
Peng Wang
652
3
0
24 Feb 2025
FastCoder: Accelerating Repository-level Code Generation via Efficient Retrieval and Verification
Qianhui Zhao
Lingling Zhang
Fang Liu
Xiaoli Lian
Qiaoyuanhe Meng
Ziqian Jiao
Zetong Zhou
Borui Zhang
Runlin Guo
329
0
0
24 Feb 2025
Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding
Tian Jin
Ellie Y. Cheng
Zack Ankner
Nikunj Saunshi
Blake M. Elias
Amir Yazdanbakhsh
Jonathan Ragan-Kelley
Suvinay Subramanian
Michael Carbin
356
18
0
24 Feb 2025
LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification
Penghui Yang
Cunxiao Du
Fengzhuo Zhang
Haonan Wang
Tianyu Pang
Chao Du
Bo An
RALM
315
2
0
24 Feb 2025
Dynamic Parallel Tree Search for Efficient LLM Reasoning
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Yifu Ding
Wentao Jiang
Shunyu Liu
Yongcheng Jing
Jinpei Guo
...
Zengmao Wang
Ziqiang Liu
Di Lin
Xianglong Liu
Dacheng Tao
LRM
488
30
0
22 Feb 2025
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2025
Yintao He
Haiyu Mao
Christina Giannoula
Mohammad Sadrosadati
Juan Gómez Luna
Huawei Li
Xiaowei Li
Ying Wang
O. Mutlu
385
21
0
21 Feb 2025
TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Zhaoxuan Wu
Zijian Zhou
Arun Verma
Alok Prakash
Daniela Rus
Bryan Kian Hsiang Low
348
3
0
21 Feb 2025
DReSD: Dense Retrieval for Speculative Decoding
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Milan Gritta
Huiyin Xue
Gerasimos Lampouras
RALM
523
1
0
21 Feb 2025
Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders
Weiqiao Shan
Yongqian Li
Yuhao Zhang
Yingfeng Luo
Chen Xu
...
Yaojie Lu
Hao Fei
Hao Yang
Tong Xiao
Jingbo Zhu
AuLLM
440
3
0
21 Feb 2025
Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models
A. Narayan
D. Biderman
Sabri Eyuboglu
Avner May
Scott W. Linderman
James Zou
Christopher Ré
261
12
0
21 Feb 2025
Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs
International Conference on Learning Representations (ICLR), 2025
Shane Bergsma
Nolan Dey
Gurpreet Gosal
Gavia Gray
Daria Soboleva
Joel Hestness
334
22
0
21 Feb 2025
C2T: A Classifier-Based Tree Construction Method in Speculative Decoding
Feiye Huo
Jianchao Tan
Jianchao Tan
Xunliang Cai
Shengli Sun
191
4
0
20 Feb 2025
SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Seanie Lee
Dong Bok Lee
Dominik Wagner
Minki Kang
Haebin Seong
Tobias Bocklet
Juho Lee
Sung Ju Hwang
539
3
0
18 Feb 2025
Translate Smart, not Hard: Cascaded Translation Systems with Quality-Aware Deferral
António Farinhas
Nuno M. Guerreiro
Sweta Agrawal
Ricardo Rei
André F. T. Martins
339
3
0
18 Feb 2025
Language Models Can Predict Their Own Behavior
Dhananjay Ashok
Jonathan May
AI4TS
ReLM
LRM
426
5
0
18 Feb 2025
SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Yige Xu
Xu Guo
Zhiwei Zeng
Chunyan Miao
LLMAG
CLL
LRM
487
64
0
17 Feb 2025
Energy-Conscious LLM Decoding: Impact of Text Generation Strategies on GPU Energy Consumption
Alireza Nik
Pål Halvorsen
Pål Halvorsen
297
1
0
17 Feb 2025
SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer
Zhengyan Sheng
Zhihao Du
Shiliang Zhang
Zhijie Yan
Yexin Yang
Zhenhua Ling
294
6
0
16 Feb 2025
Hybrid Offline-online Scheduling Method for Large Language Model Inference Optimization
Bowen Pang
Kai Li
Ruifeng She
Feifan Wang
OffRL
277
2
0
14 Feb 2025
Theoretical Benefit and Limitation of Diffusion Language Model
Guhao Feng
Yihan Geng
Jian Guan
Wei Wu
Liwei Wang
Di He
DiffM
375
1
0
13 Feb 2025
Auditing Prompt Caching in Language Model APIs
Chenchen Gu
Xiang Lisa Li
Rohith Kuditipudi
Percy Liang
Tatsunori Hashimoto
355
5
0
11 Feb 2025
Speculate, then Collaborate: Fusing Knowledge of Language Models during Decoding
Liang Luo
Muneeza Azmart
Ang Li
R. Horesh
Mikhail Yurochkin
427
6
0
11 Feb 2025
LANTERN++: Enhancing Relaxed Speculative Decoding with Static Tree Drafting for Visual Auto-regressive Models
Sihwan Park
Doohyuk Jang
Sungyub Kim
Souvik Kundu
Eunho Yang
351
7
0
10 Feb 2025
Acceleration Multiple Heads Decoding for LLM via Dynamic Tree Attention
Zhendong Zhang
150
0
0
09 Feb 2025
Towards Sustainable NLP: Insights from Benchmarking Inference Energy in Large Language Models
North American Chapter of the Association for Computational Linguistics (NAACL), 2025
S. Poddar
Paramita Koley
Janardan Misra
Niloy Ganguly
Saptarshi Ghosh
Saptarshi Ghosh
370
5
0
08 Feb 2025
Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding
North American Chapter of the Association for Computational Linguistics (NAACL), 2025
Sukmin Cho
S. Choi
T. Hwang
Jeongyeon Seo
Soyeong Jeong
Huije Lee
Hoyun Song
Jong C. Park
Youngjin Kwon
521
4
0
08 Feb 2025
Entropy Adaptive Decoding: Dynamic Model Switching for Efficient Inference
Toby Simonds
278
3
0
05 Feb 2025
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
Rishabh Tiwari
Haocheng Xi
Aditya Tomar
Coleman Hooper
Sehoon Kim
Maxwell Horton
Mahyar Najibi
Michael W. Mahoney
Kemal Kurniawan
Amir Gholami
MQ
251
9
0
05 Feb 2025
Intelligent Sensing-to-Action for Robust Autonomy at the Edge: Opportunities and Challenges
Design, Automation and Test in Europe (DATE), 2025
A. R. Trivedi
Sina Tayebati
Hemant Kumawat
Nastaran Darabi
Divake Kumar
...
Dinithi Jayasuriya
Nethmi Jayasinghe
Priyadarshini Panda
Saibal Mukhopadhyay
Kaushik Roy
387
5
0
04 Feb 2025
M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference
Nikhil Bhendawade
Mahyar Najibi
Devang Naik
Irina Belousova
MoE
453
1
0
04 Feb 2025
Position: AI Scaling: From Up to Down and Out
Yunke Wang
Yanxi Li
Chang Xu
HAI
519
1
0
02 Feb 2025
Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment
International Conference on Learning Representations (ICLR), 2025
Gregor Bachmann
Sotiris Anagnostidis
Albert Pumarola
Markos Georgopoulos
A. Sanakoyeu
Yuming Du
Edgar Schönfeld
Ali K. Thabet
Jonas Kohler
ALM
BDL
434
32
0
31 Jan 2025
Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies
Nadav Timor
Jonathan Mamou
Daniel Korat
Moshe Berchansky
Oren Pereg
Gaurav Jain
Roy Schwartz
Moshe Wasserblat
658
9
0
31 Jan 2025
Safeguarding Privacy in Edge Speech Understanding with Tiny Foundation Models
A. Benazir
Felix Xiaozhu Lin
329
2
0
29 Jan 2025
TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models
International Conference on Learning Representations (ICLR), 2025
Makoto Shing
Yuichi Inoue
Han Bao
Sho Yokoi
Takuya Akiba
VLM
582
11
0
28 Jan 2025
Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs
Nicolas Boizard
Kevin El Haddad
C´eline Hudelot
Pierre Colombo
449
27
0
28 Jan 2025
Predicting Compact Phrasal Rewrites with Large Language Models for ASR Post Editing
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025
Hao Zhang
Felix Stahlberg
Shankar Kumar
KELM
95
1
0
23 Jan 2025
Toyteller: AI-powered Visual Storytelling Through Toy-Playing with Character Symbols
International Conference on Human Factors in Computing Systems (CHI), 2025
John Joon Young Chung
Melissa Roemmele
Max Kreminski
VGen
302
6
0
23 Jan 2025
AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding
Zikun Li
Zhuofu Chen
Yingyi Huang
Xupeng Miao
Zeyu Wang
...
Zhuoming Chen
Sean Lai
Xinhao Cheng
Xupeng Miao
Zhihao Jia
308
6
0
21 Jan 2025
Multi-Agent Collaboration Mechanisms: A Survey of LLMs
Khanh-Tung Tran
Dung Dao
Minh-Duong Nguyen
Quoc-Viet Pham
Barry O'Sullivan
Hoang D. Nguyen
LLMAG
310
252
0
10 Jan 2025
DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition
Alexander Polok
Dominik Klement
M. Kocour
Jiangyu Han
Federico Landini
Bolaji Yusuf
Sanjeev Khudanpur
Sanjeev Khudanpur
J. Černocký
L. Burget
277
0
0
03 Jan 2025
Towards Sustainable Large Language Model Serving
ACM SIGEnergy Energy Informatics Review (SEIR), 2024
Sophia Nguyen
Beihao Zhou
Yi Ding
Sihang Liu
490
26
0
31 Dec 2024
A novel framework for MCDM based on Z numbers and soft likelihood function
Yuanpeng He
220
0
0
26 Dec 2024
SlimGPT: Layer-wise Structured Pruning for Large Language Models
Neural Information Processing Systems (NeurIPS), 2024
Gui Ling
Ziyang Wang
Yuliang Yan
Qingwen Liu
221
27
0
24 Dec 2024
Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels
Mingcong Song
Xinru Tang
Fengfan Hou
Jing Li
Wei Wei
...
Hongjie Si
Dengyang Jiang
Shouyi Yin
Yang Hu
Guoping Long
172
6
0
24 Dec 2024
SYMPHONY: Improving Memory Management for LLM Inference Workloads
Saurabh Agarwal
Anyong Mao
Aditya Akella
Shivaram Venkataraman
LLMAG
237
3
0
21 Dec 2024
Parallelized Autoregressive Visual Generation
Computer Vision and Pattern Recognition (CVPR), 2024
Yanjie Wang
Shuhuai Ren
Zhijie Lin
Yujin Han
Haoyuan Guo
Zhenheng Yang
Difan Zou
Jiashi Feng
Xihui Liu
VGen
649
36
0
19 Dec 2024
Deploying Foundation Model Powered Agent Services: A Survey
Wenchao Xu
Jinyu Chen
Peirong Zheng
Xiaoquan Yi
Tianyi Tian
...
Quan Wan
Yining Qi
Yunfeng Fan
Qinliang Su
Xuemin Shen
AI4CE
475
5
0
18 Dec 2024
Previous
1
2
3
...
7
8
9
...
14
15
16
Next