Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2305.09781
Cited By
SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification
16 May 2023
Xupeng Miao
Gabriele Oliaro
Zhihao Zhang
Xinhao Cheng
Zeyu Wang
Zhengxin Zhang
Rae Ying Yee Wong
Alan Zhu
Lijie Yang
Xiaoxiang Shi
Chunan Shi
Zhuoming Chen
Daiyaan Arfeen
Reyna Abhyankar
Zhihao Jia
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification"
26 / 26 papers shown
Title
SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models
Hang Wu
Jianian Zhu
Y. Li
Haojie Wang
Biao Hou
Jidong Zhai
33
0
0
12 May 2025
PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding
Bradley McDanel
S. Zhang
Y. Hu
Zining Liu
MoE
81
0
0
02 May 2025
Taming the Titans: A Survey of Efficient LLM Inference Serving
Ranran Zhen
J. Li
Yixin Ji
Z. Yang
Tong Liu
Qingrong Xia
Xinyu Duan
Z. Wang
Baoxing Huai
M. Zhang
LLMAG
77
0
0
28 Apr 2025
Token Level Routing Inference System for Edge Devices
Jianshu She
Wenhao Zheng
Zhengzhong Liu
Hongyi Wang
Eric P. Xing
Huaxiu Yao
Qirong Ho
36
0
0
10 Apr 2025
Collaborative Speculative Inference for Efficient LLM Inference Serving
Luyao Gao
Jianchun Liu
Hongli Xu
Xichong Zhang
Yunming Liao
Liusheng Huang
41
0
0
13 Mar 2025
CORAL: Learning Consistent Representations across Multi-step Training with Lighter Speculative Drafter
Yepeng Weng
Dianwen Mei
Huishi Qiu
Xujie Chen
Li Liu
Jiang Tian
Zhongchao Shi
44
0
0
24 Feb 2025
TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models
Makoto Shing
Kou Misaki
Han Bao
Sho Yokoi
Takuya Akiba
VLM
57
1
0
28 Jan 2025
iServe: An Intent-based Serving System for LLMs
Dimitrios Liakopoulos
Tianrui Hu
Prasoon Sinha
N. Yadwadkar
VLM
116
0
0
08 Jan 2025
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
Zihao Ye
Lequn Chen
Ruihang Lai
Wuwei Lin
Yineng Zhang
...
Tianqi Chen
Baris Kasikci
Vinod Grover
Arvind Krishnamurthy
Luis Ceze
65
21
0
02 Jan 2025
ProMoE: Fast MoE-based LLM Serving using Proactive Caching
Xiaoniu Song
Zihang Zhong
Rong Chen
Haibo Chen
MoE
58
4
0
29 Oct 2024
Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition
Artem Basharin
Andrei Chertkov
Ivan V. Oseledets
36
1
0
23 Oct 2024
QSpec: Speculative Decoding with Complementary Quantization Schemes
Juntao Zhao
Wenhao Lu
Sheng Wang
Lingpeng Kong
Chuan Wu
MQ
62
5
0
15 Oct 2024
SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration
Heming Xia
Yongqi Li
Jun Zhang
Cunxiao Du
Wenjie Li
LRM
46
5
0
09 Oct 2024
Efficient Inference for Large Language Model-based Generative Recommendation
Xinyu Lin
Chaoqun Yang
Wenjie Wang
Yongqi Li
Cunxiao Du
Fuli Feng
See-Kiong Ng
Tat-Seng Chua
65
4
0
07 Oct 2024
Geometric Collaborative Filtering with Convergence
Hisham Husain
Julien Monteil
FedML
25
5
0
04 Oct 2024
Mixture of Attentions For Speculative Decoding
Matthieu Zimmer
Milan Gritta
Gerasimos Lampouras
Haitham Bou Ammar
Jun Wang
74
4
0
04 Oct 2024
Efficiency Unleashed: Inference Acceleration for LLM-based Recommender Systems with Speculative Decoding
Yunjia Xi
Hangyu Wang
Bo Chen
Jianghao Lin
Menghui Zhu
W. Liu
Ruiming Tang
Zhewei Wei
W. Zhang
Yong Yu
OffRL
90
4
0
11 Aug 2024
Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting
Zilong Wang
Zifeng Wang
Long Le
Huaixiu Steven Zheng
Swaroop Mishra
...
Anush Mattapalli
Ankur Taly
Jingbo Shang
Chen-Yu Lee
Tomas Pfister
RALM
75
31
0
11 Jul 2024
Teola: Towards End-to-End Optimization of LLM-based Applications
Xin Tan
Yimin Jiang
Yitao Yang
Hong-Yu Xu
57
5
0
29 Jun 2024
On Giant's Shoulders: Effortless Weak to Strong by Dynamic Logits Fusion
Chenghao Fan
Zhenyi Lu
Wei Wei
Jie Tian
Xiaoye Qu
Dangyang Chen
Yu Cheng
MoMe
44
5
0
17 Jun 2024
Preble: Efficient Distributed Prompt Scheduling for LLM Serving
Vikranth Srivatsa
Zijian He
Reyna Abhyankar
Dongming Li
Yiying Zhang
44
17
0
08 May 2024
Instruction Tuning with GPT-4
Baolin Peng
Chunyuan Li
Pengcheng He
Michel Galley
Jianfeng Gao
SyDa
ALM
LM&MA
157
580
0
06 Apr 2023
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
Ying Sheng
Lianmin Zheng
Binhang Yuan
Zhuohan Li
Max Ryabinin
...
Joseph E. Gonzalez
Percy Liang
Christopher Ré
Ion Stoica
Ce Zhang
144
366
0
13 Mar 2023
Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks
Itay Hubara
Brian Chmiel
Moshe Island
Ron Banner
S. Naor
Daniel Soudry
44
110
0
16 Feb 2021
What Makes Good In-Context Examples for GPT-
3
3
3
?
Jiachang Liu
Dinghan Shen
Yizhe Zhang
Bill Dolan
Lawrence Carin
Weizhu Chen
AAML
RALM
275
1,312
0
17 Jan 2021
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
243
1,817
0
17 Sep 2019
1