ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2211.17192
  4. Cited By
Fast Inference from Transformers via Speculative Decoding

Fast Inference from Transformers via Speculative Decoding

30 November 2022
Yaniv Leviathan
Matan Kalman
Yossi Matias
    LRM
ArXivPDFHTML

Papers citing "Fast Inference from Transformers via Speculative Decoding"

50 / 477 papers shown
Title
Auditing Prompt Caching in Language Model APIs
Auditing Prompt Caching in Language Model APIs
Chenchen Gu
Xiang Lisa Li
Rohith Kuditipudi
Percy Liang
Tatsunori Hashimoto
68
0
0
11 Feb 2025
LANTERN++: Enhancing Relaxed Speculative Decoding with Static Tree Drafting for Visual Auto-regressive Models
LANTERN++: Enhancing Relaxed Speculative Decoding with Static Tree Drafting for Visual Auto-regressive Models
Sihwan Park
Doohyuk Jang
Sungyub Kim
Souvik Kundu
Eunho Yang
62
0
0
10 Feb 2025
Acceleration Multiple Heads Decoding for LLM via Dynamic Tree Attention
Acceleration Multiple Heads Decoding for LLM via Dynamic Tree Attention
Zhendong Zhang
48
0
0
09 Feb 2025
Towards Sustainable NLP: Insights from Benchmarking Inference Energy in Large Language Models
Towards Sustainable NLP: Insights from Benchmarking Inference Energy in Large Language Models
S. Poddar
Paramita Koley
Janardan Misra
Niloy Ganguly
Saptarshi Ghosh
Saptarshi Ghosh
59
0
0
08 Feb 2025
Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding
Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding
Sukmin Cho
S. Choi
T. Hwang
Jeongyeon Seo
Soyeong Jeong
Huije Lee
Hoyun Song
Jong C. Park
Youngjin Kwon
51
0
0
08 Feb 2025
Entropy Adaptive Decoding: Dynamic Model Switching for Efficient Inference
Entropy Adaptive Decoding: Dynamic Model Switching for Efficient Inference
Toby Simonds
66
1
0
05 Feb 2025
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
Rishabh Tiwari
Haocheng Xi
Aditya Tomar
Coleman Hooper
Sehoon Kim
Maxwell Horton
Mahyar Najibi
Michael W. Mahoney
K. K.
Amir Gholami
MQ
51
1
0
05 Feb 2025
M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference
M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference
Nikhil Bhendawade
Mahyar Najibi
Devang Naik
Irina Belousova
MoE
85
0
0
04 Feb 2025
Position: AI Scaling: From Up to Down and Out
Position: AI Scaling: From Up to Down and Out
Yunke Wang
Yanxi Li
Chang Xu
HAI
74
1
0
02 Feb 2025
Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies
Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies
Nadav Timor
Jonathan Mamou
Daniel Korat
Moshe Berchansky
Oren Pereg
Gaurav Jain
Roy Schwartz
Moshe Wasserblat
David Harel
75
1
0
31 Jan 2025
Privacy-Preserving Edge Speech Understanding with Tiny Foundation Models
Privacy-Preserving Edge Speech Understanding with Tiny Foundation Models
A. Benazir
Felix Xiaozhu Lin
36
0
0
29 Jan 2025
Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs
Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs
Nicolas Boizard
Kevin El Haddad
C´eline Hudelot
Pierre Colombo
68
14
0
28 Jan 2025
TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models
TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models
Makoto Shing
Kou Misaki
Han Bao
Sho Yokoi
Takuya Akiba
VLM
57
1
0
28 Jan 2025
Toyteller: AI-powered Visual Storytelling Through Toy-Playing with Character Symbols
Toyteller: AI-powered Visual Storytelling Through Toy-Playing with Character Symbols
John Joon Young Chung
Melissa Roemmele
Max Kreminski
VGen
67
0
0
23 Jan 2025
AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding
Zikun Li
Zhuofu Chen
Remi Delacourt
Gabriele Oliaro
Zeyu Wang
...
Zhihao Zhang
Zhuoming Chen
Sean Lai
Xupeng Miao
Zhihao Jia
47
6
0
21 Jan 2025
Multi-Agent Collaboration Mechanisms: A Survey of LLMs
Multi-Agent Collaboration Mechanisms: A Survey of LLMs
Khanh-Tung Tran
Dung Dao
Minh-Duong Nguyen
Quoc-Viet Pham
Barry O’Sullivan
Hoang D. Nguyen
LLMAG
90
23
0
10 Jan 2025
Towards Sustainable Large Language Model Serving
Towards Sustainable Large Language Model Serving
Sophia Nguyen
Beihao Zhou
Yi Ding
Sihang Liu
84
6
0
31 Dec 2024
A novel framework for MCDM based on Z numbers and soft likelihood
  function
A novel framework for MCDM based on Z numbers and soft likelihood function
Yuanpeng He
31
0
0
26 Dec 2024
SlimGPT: Layer-wise Structured Pruning for Large Language Models
SlimGPT: Layer-wise Structured Pruning for Large Language Models
Gui Ling
Ziyang Wang
Yuliang Yan
Qingwen Liu
21
2
0
24 Dec 2024
Tackling the Dynamicity in a Production LLM Serving System with SOTA
  Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient
  Meta-kernels
Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels
Mingcong Song
Xinru Tang
Fengfan Hou
Jing Li
Wei Wei
...
Hongjie Si
D. Jiang
Shouyi Yin
Yang Hu
Guoping Long
36
1
0
24 Dec 2024
SYMPHONY: Improving Memory Management for LLM Inference Workloads
SYMPHONY: Improving Memory Management for LLM Inference Workloads
Saurabh Agarwal
Anyong Mao
Aditya Akella
Shivaram Venkataraman
LLMAG
80
0
0
21 Dec 2024
Parallelized Autoregressive Visual Generation
Parallelized Autoregressive Visual Generation
Y. Wang
Shuhuai Ren
Zhijie Lin
Yujin Han
Haoyuan Guo
Zhenheng Yang
Difan Zou
Jiashi Feng
Xihui Liu
VGen
84
11
0
19 Dec 2024
Uncertainty-Aware Hybrid Inference with On-Device Small and Remote Large Language Models
Uncertainty-Aware Hybrid Inference with On-Device Small and Remote Large Language Models
Seungeun Oh
Jinhyuk Kim
Jihong Park
Seung-Woo Ko
Tony Q. S. Quek
Seong-Lyun Kim
63
3
0
17 Dec 2024
RetroLLM: Empowering Large Language Models to Retrieve Fine-grained
  Evidence within Generation
RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation
X. Li
Jiajie Jin
Yujia Zhou
Yongkang Wu
Zhonghua Li
Qi Ye
Zhicheng Dou
RALM
LRM
100
5
0
16 Dec 2024
NITRO: LLM Inference on Intel Laptop NPUs
NITRO: LLM Inference on Intel Laptop NPUs
Anthony Fei
Mohamed S. Abdelfattah
70
1
0
15 Dec 2024
Constrained Decoding with Speculative Lookaheads
Constrained Decoding with Speculative Lookaheads
Nishanth Nakshatri
Shamik Roy
Rajarshi Das
Suthee Chaidaroon
Leonid Boytsov
Rashmi Gangadharaiah
72
0
0
09 Dec 2024
CPTQuant -- A Novel Mixed Precision Post-Training Quantization
  Techniques for Large Language Models
CPTQuant -- A Novel Mixed Precision Post-Training Quantization Techniques for Large Language Models
Amitash Nanda
Sree Bhargavi Balija
D. Sahoo
MQ
59
0
0
03 Dec 2024
PLD+: Accelerating LLM inference by leveraging Language Model Artifacts
PLD+: Accelerating LLM inference by leveraging Language Model Artifacts
Shwetha Somasundaram
Anirudh Phukan
Apoorv Saxena
77
1
0
02 Dec 2024
Neutralizing Backdoors through Information Conflicts for Large Language
  Models
Neutralizing Backdoors through Information Conflicts for Large Language Models
Chen Chen
Yuchen Sun
Xueluan Gong
Jiaxin Gao
K. Lam
KELM
AAML
67
0
0
27 Nov 2024
Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration
Zhuofan Wen
Shangtong Gui
Yang Feng
96
2
0
25 Nov 2024
Closer Look at Efficient Inference Methods: A Survey of Speculative
  Decoding
Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding
Hyun Ryu
Eric Kim
72
3
0
20 Nov 2024
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering
Xinyan Guan
Yanjiang Liu
Xinyu Lu
Boxi Cao
Ben He
...
Le Sun
Jie Lou
Bowen Yu
Y. Lu
Hongyu Lin
ALM
79
2
0
18 Nov 2024
Debiasing Watermarks for Large Language Models via Maximal Coupling
Yangxinyu Xie
Xiang Li
Tanwi Mallick
Weijie J. Su
Ruixun Zhang
WaLM
34
0
0
17 Nov 2024
SAM Decoding: Speculative Decoding via Suffix Automaton
SAM Decoding: Speculative Decoding via Suffix Automaton
Yuxuan Hu
Ke Wang
Jing Zhang
Fanjin Zhang
C. Li
H. Chen
Jing Zhang
42
1
0
16 Nov 2024
SSSD: Simply-Scalable Speculative Decoding
SSSD: Simply-Scalable Speculative Decoding
Michele Marzollo
Jiawei Zhuang
Niklas Roemer
Lorenz K. Müller
Lukas Cavigelli
LRM
31
1
0
08 Nov 2024
SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding
SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding
Ryan Sun
Tianyi Zhou
Xun Chen
Lichao Sun
32
4
0
08 Nov 2024
SuffixDecoding: A Model-Free Approach to Speeding Up Large Language
  Model Inference
SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference
Gabriele Oliaro
Zhihao Jia
Daniel F Campos
Aurick Qiao
LRM
34
2
0
07 Nov 2024
The N-Grammys: Accelerating Autoregressive Inference with Learning-Free
  Batched Speculation
The N-Grammys: Accelerating Autoregressive Inference with Learning-Free Batched Speculation
Lawrence Stewart
Matthew Trager
Sujan Kumar Gonugondla
Stefano Soatto
45
2
0
06 Nov 2024
Privacy Risks of Speculative Decoding in Large Language Models
Privacy Risks of Speculative Decoding in Large Language Models
Jiankun Wei
Abdulrahman Abdulrazzag
Tianchen Zhang
Adel Muursepp
Gururaj Saileshwar
33
2
0
01 Nov 2024
Interpretable Language Modeling via Induction-head Ngram Models
Interpretable Language Modeling via Induction-head Ngram Models
Eunji Kim
Sriya Mantena
Weiwei Yang
Chandan Singh
Sungroh Yoon
Jianfeng Gao
44
0
0
31 Oct 2024
Accelerated AI Inference via Dynamic Execution Methods
Accelerated AI Inference via Dynamic Execution Methods
Haim Barad
Jascha Achterberg
Tien Pei Chou
Jean Yu
26
0
0
30 Oct 2024
A Theoretical Perspective for Speculative Decoding Algorithm
A Theoretical Perspective for Speculative Decoding Algorithm
Ming Yin
Minshuo Chen
Kaixuan Huang
Mengdi Wang
32
0
0
30 Oct 2024
ProMoE: Fast MoE-based LLM Serving using Proactive Caching
ProMoE: Fast MoE-based LLM Serving using Proactive Caching
Xiaoniu Song
Zihang Zhong
Rong Chen
Haibo Chen
MoE
53
4
0
29 Oct 2024
Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding
Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding
Bohan Li
Hankun Wang
Situo Zhang
Yiwei Guo
Kai Yu
31
5
0
29 Oct 2024
Transferable Post-training via Inverse Value Learning
Transferable Post-training via Inverse Value Learning
Xinyu Lu
Xueru Wen
Y. Lu
Bowen Yu
Hongyu Lin
Haiyang Yu
Le Sun
Xianpei Han
Yongbin Li
17
1
0
28 Oct 2024
Meta-Learning for Speeding Up Large Model Inference in Decentralized
  Environments
Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments
Yuzhe Yang
Yipeng Du
Ahmad Farhan
Claudio Angione
Yue Zhao
Harry Yang
Fielding Johnston
James Buban
Patrick Colangelo
29
0
0
28 Oct 2024
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Hanshi Sun
Li-Wen Chang
Wenlei Bao
Size Zheng
Ningxin Zheng
Xin Liu
Harry Dong
Yuejie Chi
Beidi Chen
VLM
88
16
0
28 Oct 2024
Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA
Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA
Sangmin Bae
Adam Fisch
Hrayr Harutyunyan
Ziwei Ji
Seungyeon Kim
Tal Schuster
KELM
68
5
0
28 Oct 2024
FIRP: Faster LLM inference via future intermediate representation
  prediction
FIRP: Faster LLM inference via future intermediate representation prediction
Pengfei Wu
Jiahao Liu
Zhuocheng Gong
Qifan Wang
Jinpeng Li
Jingang Wang
Xunliang Cai
Dongyan Zhao
AI4CE
22
0
0
27 Oct 2024
Fast Best-of-N Decoding via Speculative Rejection
Fast Best-of-N Decoding via Speculative Rejection
Hanshi Sun
Momin Haider
Ruiqi Zhang
Huitao Yang
Jiahao Qiu
Ming Yin
Mengdi Wang
Peter L. Bartlett
Andrea Zanette
BDL
40
26
0
26 Oct 2024
Previous
123456...8910
Next