ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2211.17192
  4. Cited By
Fast Inference from Transformers via Speculative Decoding
v1v2 (latest)

Fast Inference from Transformers via Speculative Decoding

International Conference on Machine Learning (ICML), 2022
30 November 2022
Yaniv Leviathan
Matan Kalman
Yossi Matias
    LRM
ArXiv (abs)PDFHTMLHuggingFace (9 upvotes)

Papers citing "Fast Inference from Transformers via Speculative Decoding"

50 / 763 papers shown
Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL
Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL
Zijin Hong
Zheng Yuan
Qinggang Zhang
Hao Chen
Hao-Heng Chen
Feiran Huang
Xiao Huang
853
150
0
12 Jun 2024
OPTune: Efficient Online Preference Tuning
OPTune: Efficient Online Preference Tuning
Lichang Chen
Jiuhai Chen
Chenxi Liu
John Kirchenbauer
Davit Soselia
Chen Zhu
Tom Goldstein
Wanrong Zhu
Heng Huang
130
7
0
11 Jun 2024
When Linear Attention Meets Autoregressive Decoding: Towards More
  Effective and Efficient Linearized Large Language Models
When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models
Haoran You
Yichao Fu
Zheng Wang
Amir Yazdanbakhsh
Yingyan Celine Lin
370
8
0
11 Jun 2024
Crayon: Customized On-Device LLM via Instant Adapter Blending and
  Edge-Server Hybrid Inference
Crayon: Customized On-Device LLM via Instant Adapter Blending and Edge-Server Hybrid Inference
Jihwan Bang
Juntae Lee
Kyuhong Shim
Seunghan Yang
Simyung Chang
190
10
0
11 Jun 2024
Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated
  Parameters
Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters
Yixin Song
Haotong Xie
Zhengyan Zhang
Bo Wen
Li Ma
Zeyu Mi
Haibo Chen
MoE
395
35
0
10 Jun 2024
Enabling Efficient Batch Serving for LMaaS via Generation Length
  Prediction
Enabling Efficient Batch Serving for LMaaS via Generation Length Prediction
Ke Cheng
Wen Hu
Zhi Wang
Peng Du
Jianguo Li
Sheng Zhang
300
17
0
07 Jun 2024
Proofread: Fixes All Errors with One Tap
Proofread: Fixes All Errors with One Tap
Renjie Liu
Yanxiang Zhang
Yun Zhu
Haicheng Sun
Yuanbo Zhang
Michael Xuelin Huang
Shanqing Cai
Lei Meng
Shumin Zhai
ALM
181
6
0
06 Jun 2024
To Distill or Not to Distill? On the Robustness of Robust Knowledge
  Distillation
To Distill or Not to Distill? On the Robustness of Robust Knowledge DistillationAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Abdul Waheed
Karima Kadaoui
Muhammad Abdul-Mageed
VLM
219
6
0
06 Jun 2024
Speculative Decoding via Early-exiting for Faster LLM Inference with
  Thompson Sampling Control Mechanism
Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism
Jiahao Liu
Qifan Wang
Jingang Wang
Xunliang Cai
192
30
0
06 Jun 2024
SpecExec: Massively Parallel Speculative Decoding for Interactive LLM
  Inference on Consumer Devices
SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
Ruslan Svirschevski
Avner May
Zhuoming Chen
Beidi Chen
Zhihao Jia
Max Ryabinin
348
45
0
04 Jun 2024
Block Transformer: Global-to-Local Language Modeling for Fast Inference
Block Transformer: Global-to-Local Language Modeling for Fast Inference
Namgyu Ho
Sangmin Bae
Taehyeon Kim
Hyunjik Jo
Yireun Kim
Tal Schuster
Adam Fisch
James Thorne
Se-Young Yun
315
29
0
04 Jun 2024
Diver: Large Language Model Decoding with Span-Level Mutual Information
  Verification
Diver: Large Language Model Decoding with Span-Level Mutual Information Verification
Jinliang Lu
Chen Wang
Jiajun Zhang
229
4
0
04 Jun 2024
OccamLLM: Fast and Exact Language Model Arithmetic in a Single Step
OccamLLM: Fast and Exact Language Model Arithmetic in a Single Step
Owen Dugan
Donato Manuel Jimenez Beneto
Charlotte Loh
Zhuo Chen
Rumen Dangovski
Marin Soljacic
LRM
341
4
0
04 Jun 2024
GRAM: Generative Retrieval Augmented Matching of Data Schemas in the
  Context of Data Security
GRAM: Generative Retrieval Augmented Matching of Data Schemas in the Context of Data Security
Xuanqing Liu
Luyang Kong
Runhui Wang
Patrick Song
Austin Nevins
Henrik Johnson
Nimish Amlathe
Davor Golac
172
5
0
04 Jun 2024
SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling
  for LLM
SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM
Quandong Wang
Yuxuan Yuan
Xiaoyu Yang
Ruike Zhang
Kang Zhao
Wei Liu
Jian Luan
Daniel Povey
Sijin Yu
427
0
0
03 Jun 2024
Achieving Sparse Activation in Small Language Models
Achieving Sparse Activation in Small Language Models
Jifeng Song
Kai Huang
Xiangyu Yin
Boyuan Yang
Wei Gao
207
5
0
03 Jun 2024
Decentralized AI: Permissionless LLM Inference on POKT Network
Decentralized AI: Permissionless LLM Inference on POKT Network
D. Olshansky
Ramiro Rodríguez Colmeiro
Bowen Li
MoE
71
4
0
30 May 2024
S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for
  Low-Memory GPUs
S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs
Wei Zhong
Manasa Bharadwaj
366
9
0
30 May 2024
GKT: A Novel Guidance-Based Knowledge Transfer Framework For Efficient
  Cloud-edge Collaboration LLM Deployment
GKT: A Novel Guidance-Based Knowledge Transfer Framework For Efficient Cloud-edge Collaboration LLM Deployment
Yao Yao
Z. Li
Hai Zhao
141
15
0
30 May 2024
SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
Kaixuan Huang
Xudong Guo
M. Y. Wang
534
45
0
30 May 2024
Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution
Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution
Yechen Xu
Xinhao Kong
Ying Zhang
Danyang Zhuo
LLMAG
240
8
0
29 May 2024
Faster Cascades via Speculative Decoding
Faster Cascades via Speculative Decoding
Harikrishna Narasimhan
Wittawat Jitkrittum
A. S. Rawat
Seungyeon Kim
Neha Gupta
A. Menon
Sanjiv Kumar
LRM
376
21
0
29 May 2024
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution
Minghan Li
Xilun Chen
Ari Holtzman
Beidi Chen
Jimmy Lin
Anuj Kumar
Xi Lin
RALMBDL
742
21
0
29 May 2024
Superposed Decoding: Multiple Generations from a Single Autoregressive
  Inference Pass
Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass
Ethan Shen
Alan Fan
Sarah M Pratt
Jae Sung Park
Matthew Wallingford
Sham Kakade
Ari Holtzman
Ranjay Krishna
Ali Farhadi
Aditya Kusupati
365
4
0
28 May 2024
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference
Hao Mark Chen
Wayne Luk
Ka-Fai Cedric Yiu
Rui Li
Konstantin Mishchenko
Stylianos I. Venieris
Hongxiang Fan
290
15
0
28 May 2024
Accelerating Inference of Retrieval-Augmented Generation via Sparse
  Context Selection
Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection
Yun Zhu
Jia-Chen Gu
Caitlin Sikora
Ho Ko
Yinxiao Liu
...
Lei Shu
Liangchen Luo
Lei Meng
Bang Liu
Jindong Chen
RALM
254
27
0
25 May 2024
Decoding at the Speed of Thought: Harnessing Parallel Decoding of
  Lexical Units for LLMs
Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs
Chenxi Sun
Hongzhi Zhang
Zijia Lin
Jingyuan Zhang
Fuzheng Zhang
...
Bin Chen
Chengru Song
Chen Zhang
Kun Gai
Deyi Xiong
171
2
0
24 May 2024
A Declarative System for Optimizing AI Workloads
A Declarative System for Optimizing AI Workloads
Chunwei Liu
Matthew Russo
Michael Cafarella
Lei Cao
Peter Baille Chen
Zui Chen
Michael Franklin
Tim Kraska
Samuel Madden
Gerardo Vitagliano
264
45
0
23 May 2024
Integer Scale: A Free Lunch for Faster Fine-grained Quantization of LLMs
Integer Scale: A Free Lunch for Faster Fine-grained Quantization of LLMs
Qingyuan Li
Ran Meng
Yiduo Li
Bo Zhang
Yifan Lu
Yerui Sun
Lin Ma
Yuchen Xie
MQ
238
0
0
23 May 2024
Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model Inference
Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model InferenceInternational Conference on Learning Representations (ICLR), 2024
Nadav Timor
Jonathan Mamou
Daniel Korat
Moshe Berchansky
Oren Pereg
Moshe Wasserblat
Tomer Galanti
Michal Gordon
David Harel
LRM
266
1
0
23 May 2024
Modeling Real-Time Interactive Conversations as Timed Diarized
  Transcripts
Modeling Real-Time Interactive Conversations as Timed Diarized Transcripts
Garrett Tanzer
Gustaf Ahdritz
Luke Melas-Kyriazi
345
1
0
21 May 2024
Towards Modular LLMs by Building and Reusing a Library of LoRAs
Towards Modular LLMs by Building and Reusing a Library of LoRAsInternational Conference on Machine Learning (ICML), 2024
O. Ostapenko
Zhan Su
Edoardo Ponti
Laurent Charlin
Nicolas Le Roux
Matheus Pereira
Lucas Caccia
Alessandro Sordoni
MoMe
258
55
0
18 May 2024
A Comprehensive Survey of Accelerated Generation Techniques in Large
  Language Models
A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models
Mahsa Khoshnoodi
Vinija Jain
Mingye Gao
Malavika Srikanth
Vasu Sharma
OffRL
350
9
0
15 May 2024
Challenges in Deploying Long-Context Transformers: A Theoretical Peak
  Performance Analysis
Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis
Yao Fu
198
38
0
14 May 2024
EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating
  Large Language Models
EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models
Yunsheng Ni
Chuanjian Liu
Yehui Tang
Kai Han
Yunhe Wang
253
1
0
13 May 2024
SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and
  Composition of Experts
SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts
R. Prabhakar
R. Sivaramakrishnan
Darshan Gandhi
Yun Du
Mingran Wang
...
Urmish Thakker
Dawei Huang
Sumti Jairath
Kevin J. Brown
K. Olukotun
MoE
238
27
0
13 May 2024
A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language
  Models
A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language ModelsKnowledge Discovery and Data Mining (KDD), 2024
Wenqi Fan
Yujuan Ding
Liang-bo Ning
Shijie Wang
Hengyun Li
D. Yin
Tat-Seng Chua
Qing Li
RALM3DV
625
630
0
10 May 2024
KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache
  Generation
KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache GenerationInternational Conference on Machine Learning (ICML), 2024
Minsik Cho
Mohammad Rastegari
Devang Naik
223
11
0
08 May 2024
Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large
  Language Models
Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models
Jonathan Mamou
Oren Pereg
Daniel Korat
Moshe Berchansky
Nadav Timor
Moshe Wasserblat
Roy Schwartz
197
11
0
07 May 2024
Optimising Calls to Large Language Models with Uncertainty-Based
  Two-Tier Selection
Optimising Calls to Large Language Models with Uncertainty-Based Two-Tier Selection
Guillem Ramírez
Alexandra Birch
Ivan Titov
327
19
0
03 May 2024
Clover: Regressive Lightweight Speculative Decoding with Sequential
  Knowledge
Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge
Bin Xiao
Chunan Shi
Xiaonan Nie
Fan Yang
Xiangwei Deng
Lei Su
Weipeng Chen
Tengjiao Wang
270
11
0
01 May 2024
Better & Faster Large Language Models via Multi-token Prediction
Better & Faster Large Language Models via Multi-token Prediction
Fabian Gloeckle
Badr Youbi Idrissi
Baptiste Rozière
David Lopez-Paz
Gabriele Synnaeve
313
222
0
30 Apr 2024
Accelerating Production LLMs with Combined Token/Embedding Speculators
Accelerating Production LLMs with Combined Token/Embedding Speculators
Davis Wertheimer
Joshua Rosenkranz
Thomas Parnell
Sahil Suneja
Pavithra Ranganathan
R. Ganti
Mudhakar Srivatsa
349
6
0
29 Apr 2024
Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting
Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting
Fangcheng Liu
Yehui Tang
Zhenhua Liu
Yunsheng Ni
Kai Han
Yunhe Wang
305
42
0
29 Apr 2024
BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models
BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models
Jiamin Li
Le Xu
Hong-Yu Xu
Aditya Akella
251
6
0
28 Apr 2024
Reinforcement Retrieval Leveraging Fine-grained Feedback for Fact
  Checking News Claims with Black-Box LLM
Reinforcement Retrieval Leveraging Fine-grained Feedback for Fact Checking News Claims with Black-Box LLM
Xuan Zhang
Wei Gao
LRMKELM
284
17
0
26 Apr 2024
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Mostafa Elhoushi
Akshat Shrivastava
Diana Liskovich
Basil Hosmer
Bram Wasti
...
Saurabh Agarwal
Ahmed Roman
Ahmed Aly
Beidi Chen
Carole-Jean Wu
LRM
261
190
0
25 Apr 2024
BASS: Batched Attention-optimized Speculative Sampling
BASS: Batched Attention-optimized Speculative Sampling
Haifeng Qian
Sujan Kumar Gonugondla
Sungsoo Ha
Mingyue Shang
Sanjay Krishna Gouda
Ramesh Nallapati
Sudipta Sengupta
Xiaofei Ma
Hao Ding
BDL
288
14
0
24 Apr 2024
Beyond the Speculative Game: A Survey of Speculative Execution in Large
  Language Models
Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models
Chen Zhang
Zhuorui Liu
Dawei Song
LRM
277
8
0
23 Apr 2024
Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing
Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing
Dujian Ding
Ankur Mallick
Chi Wang
Robert Sim
Subhabrata Mukherjee
Victor Rühle
L. Lakshmanan
Ahmed Hassan Awadallah
410
202
0
22 Apr 2024
Previous
123...111213141516
Next
Page 12 of 16
Pageof 16