Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2211.17192
Cited By
Fast Inference from Transformers via Speculative Decoding
30 November 2022
Yaniv Leviathan
Matan Kalman
Yossi Matias
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Fast Inference from Transformers via Speculative Decoding"
50 / 477 papers shown
Title
Dynamic layer selection in decoder-only transformers
Theodore Glavas
Joud Chataoui
Florence Regol
Wassim Jabbour
Antonios Valkanas
Boris N. Oreshkin
Mark J. Coates
AI4CE
24
0
0
26 Oct 2024
Watermarking Large Language Models and the Generated Content: Opportunities and Challenges
Ruisi Zhang
F. Koushanfar
WaLM
36
0
0
24 Oct 2024
AdaEDL: Early Draft Stopping for Speculative Decoding of Large Language Models via an Entropy-based Lower Bound on Token Acceptance Probability
Sudhanshu Agrawal
Wonseok Jeon
Mingu Lee
18
0
0
24 Oct 2024
Multi-Draft Speculative Sampling: Canonical Decomposition and Theoretical Limits
Ashish Khisti
MohammadReza Ebrahimi
Hassan Dbouk
Arash Behboodi
Roland Memisevic
Christos Louizos
23
2
0
23 Oct 2024
Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition
Artem Basharin
Andrei Chertkov
Ivan V. Oseledets
36
1
0
23 Oct 2024
Coarse-to-Fine Highlighting: Reducing Knowledge Hallucination in Large Language Models
Qitan Lv
Jie Wang
Hanzhu Chen
Bin Li
Yongdong Zhang
Feng Wu
HILM
17
3
0
19 Oct 2024
MoDification: Mixture of Depths Made Easy
C. Zhang
M. Zhong
Qimeng Wang
Xuantao Lu
Zheyu Ye
...
Yan Gao
Yao Hu
Kehai Chen
Min Zhang
Dawei Song
VLM
MoE
30
2
0
18 Oct 2024
TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling
Jiahao Qiu
Yifu Lu
Yifan Zeng
Jiacheng Guo
Jiayi Geng
Huazheng Wang
Kaixuan Huang
Yue Wu
Mengdi Wang
34
22
0
18 Oct 2024
Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding
Tan Dat Nguyen
Ji-Hoon Kim
Jeongsoo Choi
Shukjae Choi
Jinseok Park
Younglo Lee
Joon Son Chung
26
0
0
17 Oct 2024
Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement
Yuxuan Liu
Wenyuan Li
Laizhong Cui
Hailiang Yang
OffRL
18
0
0
17 Oct 2024
Learning to Route LLMs with Confidence Tokens
Yu-Neng Chuang
Helen Zhou
Prathusha Kameswara Sarma
Parikshit Gopalan
John Boccio
Sara Bolouki
Xia Hu
25
0
0
17 Oct 2024
DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure
Yunfan Xiong
Ruoyu Zhang
Yanzeng Li
Tianhao Wu
Lei Zou
16
5
0
15 Oct 2024
Learning from Imperfect Data: Towards Efficient Knowledge Distillation of Autoregressive Language Models for Text-to-SQL
Qihuang Zhong
Kunfeng Chen
Liang Ding
Juhua Liu
Bo Du
Dacheng Tao
29
0
0
15 Oct 2024
QSpec: Speculative Decoding with Complementary Quantization Schemes
Juntao Zhao
Wenhao Lu
Sheng Wang
Lingpeng Kong
Chuan Wu
MQ
62
5
0
15 Oct 2024
Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling
W. Xu
Rujun Han
Z. Wang
L. Le
Dhruv Madeka
Lei Li
W. Wang
Rishabh Agarwal
Chen-Yu Lee
Tomas Pfister
69
8
0
15 Oct 2024
Customize Your Visual Autoregressive Recipe with Set Autoregressive Modeling
Wenze Liu
Le Zhuo
Yi Xin
Sheng Xia
Peng Gao
Xiangyu Yue
29
6
0
14 Oct 2024
Probabilistic Degeneracy Detection for Point-to-Plane Error Minimization
Johan Hatleskog
Kostas Alexis
3DPC
30
0
0
14 Oct 2024
Self-Data Distillation for Recovering Quality in Pruned Large Language Models
Vithursan Thangarasa
Ganesh Venkatesh
Mike Lasby
Nish Sinnadurai
Sean Lie
SyDa
33
0
0
13 Oct 2024
COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement
Yuxi Xie
Anirudh Goyal
Xiaobao Wu
Xunjian Yin
Xiao Xu
Min-Yen Kan
Liangming Pan
William Yang Wang
LRM
45
1
0
12 Oct 2024
QEFT: Quantization for Efficient Fine-Tuning of LLMs
Changhun Lee
Jun-gyu Jin
Younghyun Cho
Eunhyeok Park
MQ
29
1
0
11 Oct 2024
KV Prediction for Improved Time to First Token
Maxwell Horton
Qingqing Cao
Chenfan Sun
Yanzi Jin
Sachin Mehta
Mohammad Rastegari
Moin Nabi
AI4TS
30
1
0
10 Oct 2024
MatMamba: A Matryoshka State Space Model
Abhinav Shukla
Sai H. Vemprala
Aditya Kusupati
Ashish Kapoor
Mamba
28
0
0
09 Oct 2024
Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level
Xinyi Zeng
Yuying Shang
Yutao Zhu
Jingyuan Zhang
Yu Tian
AAML
63
2
0
09 Oct 2024
SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration
Heming Xia
Yongqi Li
Jun Zhang
Cunxiao Du
Wenjie Li
LRM
44
4
0
09 Oct 2024
A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models
Cong Guo
Feng Cheng
Zhixu Du
James Kiessling
Jonathan Ku
...
Qilin Zheng
Guanglei Zhou
Hai
Li-Wei Li
Yiran Chen
29
5
0
08 Oct 2024
ParallelSpec: Parallel Drafter for Efficient Speculative Decoding
Zilin Xiao
Hongming Zhang
Tao Ge
Siru Ouyang
Vicente Ordonez
Dong Yu
39
5
0
08 Oct 2024
Rational Metareasoning for Large Language Models
C. Nicolò De Sabbata
T. Sumers
Thomas L. Griffiths
ReLM
LRM
28
1
0
07 Oct 2024
Efficient Inference for Large Language Model-based Generative Recommendation
Xinyu Lin
Chaoqun Yang
Wenjie Wang
Yongqi Li
Cunxiao Du
Fuli Feng
See-Kiong Ng
Tat-Seng Chua
65
4
0
07 Oct 2024
RevMUX: Data Multiplexing with Reversible Adapters for Efficient LLM Batch Inference
Yige Xu
Xu Guo
Zhiwei Zeng
Chunyan Miao
26
0
0
06 Oct 2024
SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation
Aurick Qiao
Z. Yao
Samyam Rajbhandari
Yuxiong He
VLM
27
0
0
04 Oct 2024
Geometric Collaborative Filtering with Convergence
Hisham Husain
Julien Monteil
FedML
23
5
0
04 Oct 2024
Mixture of Attentions For Speculative Decoding
Matthieu Zimmer
Milan Gritta
Gerasimos Lampouras
Haitham Bou Ammar
Jun Wang
74
4
0
04 Oct 2024
LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding
Doohyuk Jang
Sihwan Park
J. Yang
Yeonsung Jung
Jihun Yun
Souvik Kundu
Sung-Yub Kim
Eunho Yang
40
7
0
04 Oct 2024
Efficiently Deploying LLMs with Controlled Risk
Michael J. Zellinger
Matt Thomson
36
1
0
03 Oct 2024
Better Instruction-Following Through Minimum Bayes Risk
Ian Wu
Patrick Fernandes
Amanda Bertsch
Seungone Kim
Sina Pakazad
Graham Neubig
48
9
0
03 Oct 2024
Selective Attention Improves Transformer
Yaniv Leviathan
Matan Kalman
Yossi Matias
49
8
0
03 Oct 2024
Interpretable Contrastive Monte Carlo Tree Search Reasoning
Zitian Gao
Boye Niu
Xuzheng He
Haotian Xu
Hongzhang Liu
Aiwei Liu
Xuming Hu
Lijie Wen
LRM
50
24
0
02 Oct 2024
Integrative Decoding: Improve Factuality via Implicit Self-consistency
Yi Cheng
Xiao Liang
Yeyun Gong
Wen Xiao
Song Wang
...
Wenjie Li
Jian Jiao
Qi Chen
Peng Cheng
Wayne Xiong
HILM
52
1
0
02 Oct 2024
Speculative Coreset Selection for Task-Specific Fine-tuning
Xiaoyu Zhang
Juan Zhai
Shiqing Ma
Chao Shen
Tianlin Li
Weipeng Jiang
Yang Liu
30
1
0
02 Oct 2024
Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding
Yao Teng
Han Shi
Xian Liu
Xuefei Ning
Guohao Dai
Yu Wang
Zhenguo Li
Xihui Liu
48
10
0
02 Oct 2024
Approximately Aligned Decoding
Daniel Melcer
Sujan Kumar Gonugondla
Pramuditha Perera
Haifeng Qian
Wen-Hao Chiang
Yanjun Wang
Nihal Jain
Pranav Garg
Xiaofei Ma
Anoop Deoras
31
0
0
01 Oct 2024
Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine Similarity
Michael R. Metel
Peng Lu
Boxing Chen
Mehdi Rezagholizadeh
I. Kobyzev
27
3
0
01 Oct 2024
Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models
Keivan Alizadeh
Iman Mirzadeh
Hooman Shahrokhi
Dmitry Belenko
Frank Sun
Minsik Cho
Mohammad Hossein Sekhavat
Moin Nabi
Mehrdad Farajtabar
MoE
24
1
0
01 Oct 2024
Interactive Speculative Planning: Enhance Agent Efficiency through Co-design of System and User Interface
Wenyue Hua
Mengting Wan
Shashank Vadrevu
Ryan Nadel
Yongfeng Zhang
Chi Wang
LLMAG
24
1
0
30 Sep 2024
The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems
Linke Song
Zixuan Pang
Wenhao Wang
Zihao Wang
XiaoFeng Wang
Hongbo Chen
Wei Song
Yier Jin
Dan Meng
Rui Hou
43
7
0
30 Sep 2024
Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference
Zongyue Qin
Zifan He
Neha Prakriya
Jason Cong
Yizhou Sun
15
4
0
25 Sep 2024
Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR
Yael Segal-Feldman
Aviv Shamsian
Aviv Navon
Gill Hetz
Joseph Keshet
22
1
0
24 Sep 2024
Efficiently Dispatching Flash Attention For Partially Filled Attention Masks
Agniv Sharma
Jonas Geiping
14
0
0
23 Sep 2024
CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs
Junlin Lv
Yuan Feng
Xike Xie
Xin Jia
Qirong Peng
Guiming Xie
21
3
0
19 Sep 2024
Improving Multi-candidate Speculative Decoding
Xiaofan Lu
Yixiao Zeng
Feiyang Ma
Zixu Yu
Marco Levorato
26
0
0
16 Sep 2024
Previous
1
2
3
4
5
...
8
9
10
Next