ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2211.17192
  4. Cited By
Fast Inference from Transformers via Speculative Decoding

Fast Inference from Transformers via Speculative Decoding

30 November 2022
Yaniv Leviathan
Matan Kalman
Yossi Matias
    LRM
ArXivPDFHTML

Papers citing "Fast Inference from Transformers via Speculative Decoding"

50 / 482 papers shown
Title
Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference
Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference
Zongyue Qin
Zifan He
Neha Prakriya
Jason Cong
Yizhou Sun
17
4
0
25 Sep 2024
Whisper in Medusa's Ear: Multi-head Efficient Decoding for
  Transformer-based ASR
Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR
Yael Segal-Feldman
Aviv Shamsian
Aviv Navon
Gill Hetz
Joseph Keshet
22
1
0
24 Sep 2024
Efficiently Dispatching Flash Attention For Partially Filled Attention
  Masks
Efficiently Dispatching Flash Attention For Partially Filled Attention Masks
Agniv Sharma
Jonas Geiping
16
0
0
23 Sep 2024
CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling
  Acceleration in LLMs
CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs
Junlin Lv
Yuan Feng
Xike Xie
Xin Jia
Qirong Peng
Guiming Xie
21
3
0
19 Sep 2024
Improving Multi-candidate Speculative Decoding
Improving Multi-candidate Speculative Decoding
Xiaofan Lu
Yixiao Zeng
Feiyang Ma
Zixu Yu
Marco Levorato
26
0
0
16 Sep 2024
Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with
  Selective Cloud Assistance
Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with Selective Cloud Assistance
Adarsh MS
Jithin VG
Ditto PS
13
1
0
15 Sep 2024
What is the Role of Small Models in the LLM Era: A Survey
What is the Role of Small Models in the LLM Era: A Survey
Lihu Chen
Gaël Varoquaux
ALM
58
23
0
10 Sep 2024
Recall: Empowering Multimodal Embedding for Edge Devices
Recall: Empowering Multimodal Embedding for Edge Devices
Dongqi Cai
Shangguang Wang
Chen Peng
Zeling Zhang
Mengwei Xu
27
3
0
09 Sep 2024
CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and
  Selective Sparsification
CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification
Junhui He
Shangyu Wu
Weidong Wen
Chun Jason Xue
Qingan Li
13
5
0
02 Sep 2024
Dynamic Depth Decoding: Faster Speculative Decoding for LLMs
Dynamic Depth Decoding: Faster Speculative Decoding for LLMs
Oscar Brown
Zhengjie Wang
Andrea Do
Nikhil Mathew
Cheng Yu
24
4
0
30 Aug 2024
Bidirectional Decoding: Improving Action Chunking via Guided Test-Time Sampling
Bidirectional Decoding: Improving Action Chunking via Guided Test-Time Sampling
Yuejiang Liu
Jubayer Ibn Hamid
Annie Xie
Yoonho Lee
Maximilian Du
Chelsea Finn
OffRL
51
5
0
30 Aug 2024
Boosting Lossless Speculative Decoding via Feature Sampling and Partial
  Alignment Distillation
Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation
Lujun Gui
Bin Xiao
Lei Su
Weipeng Chen
25
2
0
28 Aug 2024
Learning Harmonized Representations for Speculative Sampling
Learning Harmonized Representations for Speculative Sampling
Lefan Zhang
Xiaodan Wang
Yanhua Huang
Ruiwen Xu
16
0
0
28 Aug 2024
NanoFlow: Towards Optimal Large Language Model Serving Throughput
NanoFlow: Towards Optimal Large Language Model Serving Throughput
Kan Zhu
Yilong Zhao
Liangyu Zhao
Gefei Zuo
Yile Gu
...
Keisuke Kamahori
Chien-Yu Lin
Stephanie Wang
Arvind Krishnamurthy
Baris Kasikci
30
28
0
22 Aug 2024
Predicting Rewards Alongside Tokens: Non-disruptive Parameter Insertion
  for Efficient Inference Intervention in Large Language Model
Predicting Rewards Alongside Tokens: Non-disruptive Parameter Insertion for Efficient Inference Intervention in Large Language Model
Chenhan Yuan
Fei Huang
Ru Peng
K. Lu
Bowen Yu
Chang Zhou
Jingren Zhou
KELM
30
0
0
20 Aug 2024
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding
Jian Chen
Vashisth Tiwari
Ranajoy Sadhukhan
Zhuoming Chen
Jinyuan Shi
Ian En-Hsu Yen
Ian En-Hsu Yen
Avner May
Tianqi Chen
Beidi Chen
LRM
31
22
0
20 Aug 2024
Parallel Sampling via Counting
Parallel Sampling via Counting
Nima Anari
Ruiquan Gao
Aviad Rubinstein
47
0
0
18 Aug 2024
Turning Trash into Treasure: Accelerating Inference of Large Language
  Models with Token Recycling
Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling
Xianzhen Luo
Yixuan Wang
Qingfu Zhu
Zhiming Zhang
Xuanyu Zhang
Qing Yang
Dongliang Xu
Wanxiang Che
24
3
0
16 Aug 2024
Context-Aware Assistant Selection for Improved Inference Acceleration
  with Large Language Models
Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language Models
Jerry Huang
Prasanna Parthasarathi
Mehdi Rezagholizadeh
Sarath Chandar
46
1
0
16 Aug 2024
P/D-Serve: Serving Disaggregated Large Language Model at Scale
P/D-Serve: Serving Disaggregated Large Language Model at Scale
Yibo Jin
Tao Wang
Huimin Lin
Mingyang Song
Peiyang Li
...
Haoliang Cheng
Xiaojing Li
Jiandong Ding
Hefei Guo
Zhengyong Zhang
MoE
22
8
0
15 Aug 2024
KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft
  Heads with Adversarial Learning
KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft Heads with Adversarial Learning
Kaiqi Zhang
Jing Zhao
Rui Chen
29
1
0
15 Aug 2024
Coupling without Communication and Drafter-Invariant Speculative Decoding
Coupling without Communication and Drafter-Invariant Speculative Decoding
Majid Daliri
Christopher Musco
A. Suresh
46
1
0
15 Aug 2024
Kraken: Inherently Parallel Transformers For Efficient Multi-Device
  Inference
Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference
R. Prabhakar
Hengrui Zhang
D. Wentzlaff
23
0
0
14 Aug 2024
PEARL: Parallel Speculative Decoding with Adaptive Draft Length
PEARL: Parallel Speculative Decoding with Adaptive Draft Length
Tianyu Liu
Yun Li
Qitan Lv
Kai Liu
Jianchen Zhu
Winston Hu
X. Sun
39
11
0
13 Aug 2024
Post-Training Sparse Attention with Double Sparsity
Post-Training Sparse Attention with Double Sparsity
Shuo Yang
Ying Sheng
Joseph E. Gonzalez
Ion Stoica
Lianmin Zheng
28
7
0
11 Aug 2024
Efficiency Unleashed: Inference Acceleration for LLM-based Recommender Systems with Speculative Decoding
Efficiency Unleashed: Inference Acceleration for LLM-based Recommender Systems with Speculative Decoding
Yunjia Xi
Hangyu Wang
Bo Chen
Jianghao Lin
Menghui Zhu
W. Liu
Ruiming Tang
Zhewei Wei
W. Zhang
Yong Yu
OffRL
84
4
0
11 Aug 2024
Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion
Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion
Jacob K Christopher
Brian Bartoldson
Tal Ben-Nun
Michael Cardei
B. Kailkhura
Ferdinando Fioretto
DiffM
32
2
0
10 Aug 2024
Retrieval-augmented code completion for local projects using large
  language models
Retrieval-augmented code completion for local projects using large language models
Marko Hostnik
Marko Robnik-Sikonja
RALM
27
0
0
09 Aug 2024
CREST: Effectively Compacting a Datastore For Retrieval-Based
  Speculative Decoding
CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding
Sophia Ho
Jinsol Park
Patrick Wang
24
0
0
08 Aug 2024
StructuredRAG: JSON Response Formatting with Large Language Models
StructuredRAG: JSON Response Formatting with Large Language Models
Connor Shorten
Charles Pierse
Thomas Benjamin Smith
Erika Cardenas
Akanksha Sharma
John Trengrove
Bob van Luijt
16
4
0
07 Aug 2024
Inference Optimizations for Large Language Models: Effects, Challenges,
  and Practical Considerations
Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations
Leo Donisch
Sigurd Schacht
Carsten Lanquillon
22
2
0
06 Aug 2024
Clover-2: Accurate Inference for Regressive Lightweight Speculative
  Decoding
Clover-2: Accurate Inference for Regressive Lightweight Speculative Decoding
Bin Xiao
Lujun Gui
Lei Su
Weipeng Chen
18
3
0
01 Aug 2024
ThinK: Thinner Key Cache by Query-Driven Pruning
ThinK: Thinner Key Cache by Query-Driven Pruning
Yuhui Xu
Zhanming Jie
Hanze Dong
Lei Wang
Xudong Lu
Aojun Zhou
Amrita Saha
Caiming Xiong
Doyen Sahoo
67
14
0
30 Jul 2024
Inference acceleration for large language models using "stairs" assisted
  greedy generation
Inference acceleration for large language models using "stairs" assisted greedy generation
Domas Grigaliunas
M. Lukoševičius
19
0
0
29 Jul 2024
Graph-Structured Speculative Decoding
Graph-Structured Speculative Decoding
Zhuocheng Gong
Jiahao Liu
Ziyue Wang
Pengfei Wu
Jingang Wang
Xunliang Cai
Dongyan Zhao
Rui Yan
21
3
0
23 Jul 2024
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
Qichen Fu
Minsik Cho
Thomas Merth
Sachin Mehta
Mohammad Rastegari
Mahyar Najibi
33
25
0
19 Jul 2024
Beyond Next Token Prediction: Patch-Level Training for Large Language Models
Beyond Next Token Prediction: Patch-Level Training for Large Language Models
Chenze Shao
Fandong Meng
Jie Zhou
41
1
0
17 Jul 2024
Q-Sparse: All Large Language Models can be Fully Sparsely-Activated
Q-Sparse: All Large Language Models can be Fully Sparsely-Activated
Hongyu Wang
Shuming Ma
Ruiping Wang
Furu Wei
MoE
33
11
0
15 Jul 2024
Accelerating the inference of string generation-based chemical reaction
  models for industrial applications
Accelerating the inference of string generation-based chemical reaction models for industrial applications
Mikhail Andronov
Natalia Andronova
Michael Wand
Jürgen Schmidhuber
Djork-Arné Clevert
AI4CE
23
3
0
12 Jul 2024
Inference Optimization of Foundation Models on AI Accelerators
Inference Optimization of Foundation Models on AI Accelerators
Youngsuk Park
Kailash Budhathoki
Liangfu Chen
Jonas M. Kübler
Jiaji Huang
Matthäus Kleindessner
Jun Huan
V. Cevher
Yida Wang
George Karypis
37
3
0
12 Jul 2024
Automata-based constraints for language model decoding
Automata-based constraints for language model decoding
Terry Koo
Frederick Liu
Luheng He
AI4CE
34
16
0
11 Jul 2024
Robotic Control via Embodied Chain-of-Thought Reasoning
Robotic Control via Embodied Chain-of-Thought Reasoning
Michał Zawalski
William Chen
Karl Pertsch
Oier Mees
Chelsea Finn
Sergey Levine
LRM
LM&Ro
32
52
0
11 Jul 2024
Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting
Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting
Zilong Wang
Zifeng Wang
Long Le
Huaixiu Steven Zheng
Swaroop Mishra
...
Anush Mattapalli
Ankur Taly
Jingbo Shang
Chen-Yu Lee
Tomas Pfister
RALM
75
31
0
11 Jul 2024
Knowledge boosting during low-latency inference
Knowledge boosting during low-latency inference
Vidya Srinivas
Malek Itani
Tuochao Chen
Sefik Emre Eskimez
Takuya Yoshioka
Shyamnath Gollakota
19
2
0
09 Jul 2024
Etalon: Holistic Performance Evaluation Framework for LLM Inference
  Systems
Etalon: Holistic Performance Evaluation Framework for LLM Inference Systems
Amey Agrawal
Anmol Agarwal
Nitin Kedia
Jayashree Mohan
Souvik Kundu
Nipun Kwatra
R. Ramjee
Alexey Tumanov
24
5
0
09 Jul 2024
Mobile Edge Intelligence for Large Language Models: A Contemporary Survey
Mobile Edge Intelligence for Large Language Models: A Contemporary Survey
Guanqiao Qu
Qiyuan Chen
Wei Wei
Zheng Lin
Xianhao Chen
Kaibin Huang
35
41
0
09 Jul 2024
Merge, Ensemble, and Cooperate! A Survey on Collaborative Strategies in
  the Era of Large Language Models
Merge, Ensemble, and Cooperate! A Survey on Collaborative Strategies in the Era of Large Language Models
Jinliang Lu
Ziliang Pang
Min Xiao
Yaochen Zhu
Rui Xia
Jiajun Zhang
MoMe
29
18
0
08 Jul 2024
Pruning Large Language Models to Intra-module Low-rank Architecture with
  Transitional Activations
Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations
Bowen Shen
Zheng-Shen Lin
Daren Zha
Wei Liu
Jian Luan
Bin Wang
Weiping Wang
52
1
0
08 Jul 2024
Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of
  Language Models
Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models
Bolaji Yusuf
M. Baskar
Andrew Rosenberg
Bhuvana Ramabhadran
35
1
0
05 Jul 2024
Uncertainty-Guided Optimization on Large Language Model Search Trees
Uncertainty-Guided Optimization on Large Language Model Search Trees
Julia Grosse
Ruotian Wu
Ahmad Rashid
Philipp Hennig
Pascal Poupart
Agustinus Kristiadi
32
1
0
04 Jul 2024
Previous
123456...8910
Next