Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2211.17192
Cited By
v1
v2 (latest)
Fast Inference from Transformers via Speculative Decoding
International Conference on Machine Learning (ICML), 2022
30 November 2022
Yaniv Leviathan
Matan Kalman
Yossi Matias
LRM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (9 upvotes)
Papers citing
"Fast Inference from Transformers via Speculative Decoding"
50 / 763 papers shown
MatMamba: A Matryoshka State Space Model
Abhinav Shukla
Sai H. Vemprala
Aditya Kusupati
Ashish Kapoor
Mamba
252
3
0
09 Oct 2024
SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration
International Conference on Learning Representations (ICLR), 2024
Heming Xia
Yongqi Li
Jun Zhang
Cunxiao Du
Wenjie Li
LRM
334
38
0
09 Oct 2024
Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level
Xinyi Zeng
Yuying Shang
Yutao Zhu
Jingyuan Zhang
Yu Tian
AAML
1.1K
13
0
09 Oct 2024
A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models
IEEE Circuits and Systems Magazine (IEEE CSM), 2024
Cong Guo
Feng Cheng
Zhixu Du
James Kiessling
Jonathan Ku
...
Qilin Zheng
Guanglei Zhou
Hai
Li-Wei Li
Yiran Chen
226
19
0
08 Oct 2024
ParallelSpec: Parallel Drafter for Efficient Speculative Decoding
Zilin Xiao
Hongming Zhang
Tao Ge
Siru Ouyang
Vicente Ordonez
Dong Yu
251
13
0
08 Oct 2024
Efficient Inference for Large Language Model-based Generative Recommendation
International Conference on Learning Representations (ICLR), 2024
Xinyu Lin
Chaoqun Yang
Wenjie Wang
Yongqi Li
Cunxiao Du
Fuli Feng
See-Kiong Ng
Tat-Seng Chua
371
13
0
07 Oct 2024
Rational Metareasoning for Large Language Models
C. Nicolò De Sabbata
T. Sumers
Badr AlKhamissi
Antoine Bosselut
Thomas Griffiths
LRM
ReLM
445
9
0
07 Oct 2024
RevMUX: Data Multiplexing with Reversible Adapters for Efficient LLM Batch Inference
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Yige Xu
Xu Guo
Zhiwei Zeng
Chunyan Miao
189
0
0
06 Oct 2024
Geometric Collaborative Filtering with Convergence
International Conference on Artificial Intelligence and Statistics (AISTATS), 2024
Hisham Husain
Julien Monteil
FedML
453
21
0
04 Oct 2024
Mixture of Attentions For Speculative Decoding
International Conference on Learning Representations (ICLR), 2024
Matthieu Zimmer
Milan Gritta
Gerasimos Lampouras
Haitham Bou Ammar
Jun Wang
342
12
0
04 Oct 2024
SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation
Aurick Qiao
Z. Yao
Samyam Rajbhandari
Yuxiong He
VLM
346
6
0
04 Oct 2024
LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding
International Conference on Learning Representations (ICLR), 2024
Doohyuk Jang
Sihwan Park
J. Yang
Yeonsung Jung
Jihun Yun
Souvik Kundu
Sung-Yub Kim
Eunho Yang
473
29
0
04 Oct 2024
Efficiently Deploying LLMs with Controlled Risk
Michael J. Zellinger
Matt Thomson
281
3
0
03 Oct 2024
Better Instruction-Following Through Minimum Bayes Risk
International Conference on Learning Representations (ICLR), 2024
Ian Wu
Patrick Fernandes
Amanda Bertsch
Seungone Kim
Sina Pakazad
Graham Neubig
594
15
0
03 Oct 2024
Selective Attention Improves Transformer
International Conference on Learning Representations (ICLR), 2024
Yaniv Leviathan
Matan Kalman
Yossi Matias
359
20
0
03 Oct 2024
Inductive Generative Recommendation via Retrieval-based Speculation
Yijie Ding
Jiacheng Li
Julian McAuley
Yupeng Hou
138
12
0
03 Oct 2024
Interpretable Contrastive Monte Carlo Tree Search Reasoning
Zitian Gao
Boye Niu
Xuzheng He
Haotian Xu
Hongzhang Liu
Aiwei Liu
Xuming Hu
Lijie Wen
LRM
479
59
0
02 Oct 2024
Integrative Decoding: Improve Factuality via Implicit Self-consistency
Yi Cheng
Xiao Liang
Yeyun Gong
Wen Xiao
Song Wang
...
Wenjie Li
Jian Jiao
Qi Chen
Peng Cheng
Wayne Xiong
HILM
509
6
0
02 Oct 2024
Speculative Coreset Selection for Task-Specific Fine-tuning
Xiaoyu Zhang
Juan Zhai
Shiqing Ma
Chao Shen
Tianlin Li
Weipeng Jiang
Yang Liu
211
9
0
02 Oct 2024
Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding
International Conference on Learning Representations (ICLR), 2024
Yao Teng
Han Shi
Xian Liu
Xuefei Ning
Guohao Dai
Yu Wang
Zhenguo Li
Xihui Liu
384
42
0
02 Oct 2024
Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine Similarity
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Michael R. Metel
Peng Lu
Boxing Chen
Mehdi Rezagholizadeh
I. Kobyzev
170
8
0
01 Oct 2024
Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models
Keivan Alizadeh
Iman Mirzadeh
Hooman Shahrokhi
Dmitry Belenko
Frank Sun
Minsik Cho
Mohammad Hossein Sekhavat
Moin Nabi
Mehrdad Farajtabar
MoE
279
2
0
01 Oct 2024
Approximately Aligned Decoding
Daniel Melcer
Sujan Kumar Gonugondla
Pramuditha Perera
Haifeng Qian
Wen-Hao Chiang
Yanjun Wang
Nihal Jain
Pranav Garg
Xiaofei Ma
Hao Ding
301
2
0
01 Oct 2024
Interactive Speculative Planning: Enhance Agent Efficiency through Co-design of System and User Interface
Qingfeng Lan
Mengting Wan
Shashank Vadrevu
Ryan Nadel
Yongfeng Zhang
Chi Wang
LLMAG
195
8
0
30 Sep 2024
Characterizing and Efficiently Accelerating Multimodal Generation Model Inference
Yejin Lee
Anna Y. Sun
Basil Hosmer
Bilge Acun
Can Balioglu
...
Ram Pasunuru
Scott Yih
Sravya Popuri
Xing Liu
Carole-Jean Wu
475
5
0
30 Sep 2024
The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems
IEEE Transactions on Information Forensics and Security (IEEE TIFS), 2024
Linke Song
Zixuan Pang
Wenhao Wang
Zihao Wang
XiaoFeng Wang
H. G. Chen
Wei Song
Yier Jin
Dan Meng
Rui Hou
617
18
0
30 Sep 2024
Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference
Zongyue Qin
Zifan He
Neha Prakriya
Jason Cong
Yizhou Sun
284
7
0
25 Sep 2024
Accumulator-Aware Post-Training Quantization for Large Language Models
Ian Colbert
Giuseppe Franco
Fabian Grob
Jinjie Zhang
Rayan Saab
MQ
277
4
0
25 Sep 2024
Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR
Yael Segal-Feldman
Aviv Shamsian
Aviv Navon
Gill Hetz
Joseph Keshet
178
6
0
24 Sep 2024
Efficiently Dispatching Flash Attention For Partially Filled Attention Masks
Agniv Sharma
Jonas Geiping
217
2
0
23 Sep 2024
CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Junlin Lv
Yuan Feng
Xike Xie
Xin Jia
Qirong Peng
Guiming Xie
287
5
0
19 Sep 2024
Improving Multi-candidate Speculative Decoding
Xiaofan Lu
Yixiao Zeng
Feiyang Ma
Zixu Yu
Marco Levorato
106
4
0
16 Sep 2024
Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with Selective Cloud Assistance
Adarsh MS
Jithin VG
Ditto PS
118
3
0
15 Sep 2024
What is the Role of Small Models in the LLM Era: A Survey
Lihu Chen
Gaël Varoquaux
ALM
784
55
0
10 Sep 2024
Recall: Empowering Multimodal Embedding for Edge Devices
Dongqi Cai
Shangguang Wang
Chen Peng
Zeling Zhang
Mengwei Xu
180
4
0
09 Sep 2024
An overview of domain-specific foundation model: key technologies, applications and challenges
Science China Information Sciences (Sci. China Inf. Sci.), 2024
Haolong Chen
Hanzhi Chen
Zijian Zhao
Kaifeng Han
Guangxu Zhu
Yichen Zhao
Ying Du
Wei Xu
Qingjiang Shi
ALM
VLM
492
19
0
06 Sep 2024
CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Junhui He
Shangyu Wu
Weidong Wen
Chun Jason Xue
Qingan Li
99
8
0
02 Sep 2024
Dynamic Depth Decoding: Faster Speculative Decoding for LLMs
Oscar Brown
Zhengjie Wang
Andrea Do
Nikhil Mathew
Cheng Yu
293
11
0
30 Aug 2024
Bidirectional Decoding: Improving Action Chunking via Guided Test-Time Sampling
International Conference on Learning Representations (ICLR), 2024
Yuejiang Liu
Jubayer Ibn Hamid
Annie Xie
Yoonho Lee
Maximilian Du
Chelsea Finn
OffRL
371
5
0
30 Aug 2024
Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation
Lujun Gui
Bin Xiao
Lei Su
Weipeng Chen
190
7
0
28 Aug 2024
Learning Harmonized Representations for Speculative Sampling
International Conference on Learning Representations (ICLR), 2024
Lefan Zhang
Xiaodan Wang
Yanhua Huang
Ruiwen Xu
314
0
0
28 Aug 2024
NanoFlow: Towards Optimal Large Language Model Serving Throughput
Kan Zhu
Yilong Zhao
Liangyu Zhao
Gefei Zuo
Yile Gu
...
Keisuke Kamahori
Chien-Yu Lin
Stephanie Wang
Arvind Krishnamurthy
Baris Kasikci
229
71
0
22 Aug 2024
Predicting Rewards Alongside Tokens: Non-disruptive Parameter Insertion for Efficient Inference Intervention in Large Language Model
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Chenhan Yuan
Fei Huang
Ru Peng
Keming Lu
Bowen Yu
Chang Zhou
Jingren Zhou
KELM
217
0
0
20 Aug 2024
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding
International Conference on Learning Representations (ICLR), 2024
Jian Chen
Vashisth Tiwari
Ranajoy Sadhukhan
Zhuoming Chen
Jinyuan Shi
Ian En-Hsu Yen
Ian En-Hsu Yen
Avner May
Tianqi Chen
Beidi Chen
LRM
677
60
0
20 Aug 2024
Parallel Sampling via Counting
Symposium on the Theory of Computing (STOC), 2024
Nima Anari
Ruiquan Gao
Aviad Rubinstein
184
9
0
18 Aug 2024
Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Jerry Huang
Prasanna Parthasarathi
Mehdi Rezagholizadeh
Sarath Chandar
233
5
0
16 Aug 2024
Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Xianzhen Luo
Yixuan Wang
Qingfu Zhu
Zhiming Zhang
Xuanyu Zhang
Qing Yang
Dongliang Xu
454
24
0
16 Aug 2024
P/D-Serve: Serving Disaggregated Large Language Model at Scale
Yibo Jin
Tao Wang
Huimin Lin
Mingyang Song
Peiyang Li
...
Haoliang Cheng
Xiaojing Li
Jiandong Ding
Hefei Guo
Zhengyong Zhang
MoE
215
29
0
15 Aug 2024
KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft Heads with Adversarial Learning
International Conference on Computer Supported Cooperative Work in Design (CSCWD), 2024
Kaiqi Zhang
Jing Zhao
Rui Chen
315
5
0
15 Aug 2024
Coupling without Communication and Drafter-Invariant Speculative Decoding
International Symposium on Information Theory (ISIT), 2024
Majid Daliri
Christopher Musco
A. Suresh
398
2
0
15 Aug 2024
Previous
1
2
3
...
9
10
11
...
14
15
16
Next
Page 10 of 16
Page
of 16
Go