Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2211.17192
Cited By
Fast Inference from Transformers via Speculative Decoding
30 November 2022
Yaniv Leviathan
Matan Kalman
Yossi Matias
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Fast Inference from Transformers via Speculative Decoding"
50 / 477 papers shown
Title
Multi-Candidate Speculative Decoding
Sen Yang
Shujian Huang
Xinyu Dai
Jiajun Chen
BDL
23
15
0
12 Jan 2024
Distilling Vision-Language Models on Millions of Videos
Yue Zhao
Long Zhao
Xingyi Zhou
Jialin Wu
Chun-Te Chu
...
Hartwig Adam
Ting Liu
Boqing Gong
Philipp Krahenbuhl
Liangzhe Yuan
VLM
21
13
0
11 Jan 2024
Pheme: Efficient and Conversational Speech Generation
Paweł Budzianowski
Taras Sereda
Tomasz Cichy
Ivan Vulić
21
7
0
05 Jan 2024
Training and Serving System of Foundation Models: A Comprehensive Survey
Jiahang Zhou
Yanyu Chen
Zicong Hong
Wuhui Chen
Yue Yu
Tao Zhang
Hui Wang
Chuan-fu Zhang
Zibin Zheng
ALM
17
5
0
05 Jan 2024
IoT in the Era of Generative AI: Vision and Challenges
Xin Wang
Zhongwei Wan
Arvin Hekmati
M. Zong
Samiul Alam
Mi Zhang
Bhaskar Krishnamachari
24
15
0
03 Jan 2024
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
Xupeng Miao
Gabriele Oliaro
Zhihao Zhang
Xinhao Cheng
Hongyi Jin
Tianqi Chen
Zhihao Jia
53
75
0
23 Dec 2023
Structure-Aware Path Inference for Neural Finite State Transducers
Weiting Tan
Chu-cheng Lin
Jason Eisner
8
0
0
21 Dec 2023
Cascade Speculative Drafting for Even Faster LLM Inference
Ziyi Chen
Xiaocong Yang
Jiacheng Lin
Chenkai Sun
Kevin Chen-Chuan Chang
Jie Huang
LRM
19
46
0
18 Dec 2023
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Peiyi Wang
Lei Li
Zhihong Shao
R. X. Xu
Damai Dai
Yifei Li
Deli Chen
Y.Wu
Zhifang Sui
AIMat
LRM
ALM
31
258
0
14 Dec 2023
LLM in a flash: Efficient Large Language Model Inference with Limited Memory
Keivan Alizadeh-Vahid
Iman Mirzadeh
Dmitry Belenko
Karen Khatamifard
Minsik Cho
C. C. D. Mundo
Mohammad Rastegari
Mehrdad Farajtabar
70
104
0
12 Dec 2023
A Review of Hybrid and Ensemble in Deep Learning for Natural Language Processing
Jianguo Jia
Wen-Chieh Liang
Youzhi Liang
VLM
10
16
0
09 Dec 2023
Stateful Large Language Model Serving with Pensieve
Lingfan Yu
Jinyang Li
RALM
KELM
LLMAG
26
11
0
09 Dec 2023
Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving
Yinwei Dai
Rui Pan
Anand Iyer
Kai Li
Ravi Netravali
13
7
0
08 Dec 2023
EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism
Yanxi Chen
Xuchen Pan
Yaliang Li
Bolin Ding
Jingren Zhou
LRM
12
31
0
08 Dec 2023
An LLM Compiler for Parallel Function Calling
Sehoon Kim
Suhong Moon
Ryan Tabrizi
Nicholas Lee
Michael W. Mahoney
Kurt Keutzer
A. Gholami
LRM
16
58
0
07 Dec 2023
TextGenSHAP: Scalable Post-hoc Explanations in Text Generation with Long Documents
James Enouen
Hootan Nakhost
Sayna Ebrahimi
Sercan Ö. Arik
Yan Liu
Tomas Pfister
33
4
0
03 Dec 2023
ChatGPT's One-year Anniversary: Are Open-Source Large Language Models Catching up?
Hailin Chen
Fangkai Jiao
Xingxuan Li
Chengwei Qin
Mathieu Ravaut
Ruochen Zhao
Caiming Xiong
Shafiq R. Joty
ELM
CLL
AI4MH
LRM
ALM
77
27
0
28 Nov 2023
PaSS: Parallel Speculative Sampling
Giovanni Monea
Armand Joulin
Edouard Grave
MoE
6
30
0
22 Nov 2023
HexGen: Generative Inference of Large Language Model over Heterogeneous Environment
Youhe Jiang
Ran Yan
Xiaozhe Yao
Yang Zhou
Beidi Chen
Binhang Yuan
SyDa
8
10
0
20 Nov 2023
Speculative Contrastive Decoding
Hongyi Yuan
Keming Lu
Fei Huang
Zheng Yuan
Chang Zhou
28
5
0
15 Nov 2023
Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster
Hongxuan Zhang
Zhining Liu
Yao Zhao
Jiaqi Zheng
Chenyi Zhuang
Jinjie Gu
Guihai Chen
LRM
MLLM
10
1
0
14 Nov 2023
REST: Retrieval-Based Speculative Decoding
Zhenyu He
Zexuan Zhong
Tianle Cai
Jason D. Lee
Di He
RALM
6
74
0
14 Nov 2023
JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models
Zihao Wang
Shaofei Cai
Anji Liu
Yonggang Jin
Jinbing Hou
...
Zhaofeng He
Zilong Zheng
Yaodong Yang
Xiaojian Ma
Yitao Liang
LLMAG
LM&Ro
17
95
0
10 Nov 2023
GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values
Farnoosh Javadi
Walid Ahmed
Habib Hajimolahoseini
Foozhan Ataiefard
Mohammad Hassanpour
Saina Asani
Austin Wen
Omar Mohamed Awad
Kangling Liu
Yang Liu
VLM
23
7
0
06 Nov 2023
Improving Machine Translation with Large Language Models: A Preliminary Study with Cooperative Decoding
Jiali Zeng
Fandong Meng
Yongjing Yin
Jie Zhou
21
10
0
06 Nov 2023
Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization
Bjorn Deiseroth
Max Meuer
Nikolas Gritsch
C. Eichenberg
P. Schramowski
Matthias Aßenmacher
Kristian Kersting
14
3
0
02 Nov 2023
Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling
Sanchit Gandhi
Patrick von Platen
Alexander M. Rush
VLM
8
49
0
01 Nov 2023
The Synergy of Speculative Decoding and Batching in Serving Large Language Models
Qidong Su
Christina Giannoula
Gennady Pekhimenko
11
10
0
28 Oct 2023
Punica: Multi-Tenant LoRA Serving
Lequn Chen
Zihao Ye
Yongji Wu
Danyang Zhuo
Luis Ceze
Arvind Krishnamurthy
31
34
0
28 Oct 2023
Controlled Decoding from Language Models
Sidharth Mudgal
Jong Lee
H. Ganapathy
Yaguang Li
Tao Wang
...
Michael Collins
Trevor Strohman
Jilin Chen
Alex Beutel
Ahmad Beirami
32
69
0
25 Oct 2023
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Elias Frantar
Dan Alistarh
MQ
MoE
19
24
0
25 Oct 2023
SpecTr: Fast Speculative Decoding via Optimal Transport
Ziteng Sun
A. Suresh
Jae Hun Ro
Ahmad Beirami
Himanshu Jain
Felix X. Yu
40
64
0
23 Oct 2023
Large Search Model: Redefining Search Stack in the Era of LLMs
Liang Wang
Nan Yang
Xiaolong Huang
Linjun Yang
Rangan Majumder
Furu Wei
LRM
KELM
20
13
0
23 Oct 2023
An Emulator for Fine-Tuning Large Language Models using Small Language Models
Eric Mitchell
Rafael Rafailov
Archit Sharma
Chelsea Finn
Christopher D. Manning
ALM
27
51
0
19 Oct 2023
SPEED: Speculative Pipelined Execution for Efficient Decoding
Coleman Hooper
Sehoon Kim
Hiva Mohammadzadeh
Hasan Genç
Kurt Keutzer
A. Gholami
Y. Shao
22
34
0
18 Oct 2023
BitNet: Scaling 1-bit Transformers for Large Language Models
Hongyu Wang
Shuming Ma
Li Dong
Shaohan Huang
Huaijie Wang
Lingxiao Ma
Fan Yang
Ruiping Wang
Yi Wu
Furu Wei
MQ
12
95
0
17 Oct 2023
Enhanced Transformer Architecture for Natural Language Processing
Woohyeon Moon
Taeyoung Kim
Bumgeun Park
Dongsoo Har
16
0
0
17 Oct 2023
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models
Saleh Ashkboos
Ilia Markov
Elias Frantar
Tingxuan Zhong
Xincheng Wang
Jie Ren
Torsten Hoefler
Dan Alistarh
MQ
SyDa
117
21
0
13 Oct 2023
Tree-Planner: Efficient Close-loop Task Planning with Large Language Models
Mengkang Hu
Yao Mu
Xinmiao Yu
Mingyu Ding
Shiguang Wu
Wenqi Shao
Qiguang Chen
Bin Wang
Yu Qiao
Ping Luo
LLMAG
39
33
0
12 Oct 2023
DistillSpec: Improving Speculative Decoding via Knowledge Distillation
Yongchao Zhou
Kaifeng Lyu
A. S. Rawat
A. Menon
Afshin Rostamizadeh
Sanjiv Kumar
Jean-François Kagy
Rishabh Agarwal
42
77
0
12 Oct 2023
MatFormer: Nested Transformer for Elastic Inference
Devvrit
Sneha Kudugunta
Aditya Kusupati
Tim Dettmers
Kaifeng Chen
...
Yulia Tsvetkov
Hannaneh Hajishirzi
Sham Kakade
Ali Farhadi
Prateek Jain
26
22
0
11 Oct 2023
CacheGen: KV Cache Compression and Streaming for Fast Language Model Serving
Yuhan Liu
Hanchen Li
Yihua Cheng
Siddhant Ray
Yuyang Huang
...
Ganesh Ananthanarayanan
Michael Maire
Henry Hoffmann
Ari Holtzman
Junchen Jiang
50
41
0
11 Oct 2023
Online Speculative Decoding
Xiaoxuan Liu
Lanxiang Hu
Peter Bailis
Alvin Cheung
Zhijie Deng
Ion Stoica
Hao Zhang
15
49
0
11 Oct 2023
CoQuest: Exploring Research Question Co-Creation with an LLM-based Agent
Yiren Liu
Si Chen
Haocong Cheng
Mengxia Yu
Xiao Ran
Andrew Mo
Yiliu Tang
Yun Huang
LLMAG
28
44
0
09 Oct 2023
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding
Sangmin Bae
Jongwoo Ko
Hwanjun Song
SeYoung Yun
17
53
0
09 Oct 2023
ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models
Iman Mirzadeh
Keivan Alizadeh-Vahid
Sachin Mehta
C. C. D. Mundo
Oncel Tuzel
Golnoosh Samei
Mohammad Rastegari
Mehrdad Farajtabar
118
58
0
06 Oct 2023
DirectGPT: A Direct Manipulation Interface to Interact with Large Language Models
Damien Masson
Sylvain Malacria
Géry Casiez
Daniel Vogel
AI4CE
KELM
MLLM
23
34
0
05 Oct 2023
Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning
Murong Yue
Jie Zhao
Min Zhang
Liang Du
Ziyu Yao
LRM
22
54
0
04 Oct 2023
Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training
Xidong Feng
Ziyu Wan
Muning Wen
Stephen Marcus McAleer
Ying Wen
Weinan Zhang
Jun Wang
LRM
AI4CE
22
147
0
29 Sep 2023
Pushing Large Language Models to the 6G Edge: Vision, Challenges, and Opportunities
Zhengyi Lin
Guanqiao Qu
Qiyuan Chen
Randy Sarayar
Zhe Chen
Kaibin Huang
10
89
0
28 Sep 2023
Previous
1
2
3
...
10
8
9
Next