Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2211.17192
Cited By
Fast Inference from Transformers via Speculative Decoding
30 November 2022
Yaniv Leviathan
Matan Kalman
Yossi Matias
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Fast Inference from Transformers via Speculative Decoding"
50 / 477 papers shown
Title
SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models
Hang Wu
Jianian Zhu
Y. Li
Haojie Wang
Biao Hou
Jidong Zhai
25
0
0
12 May 2025
Scaling Laws for Speculative Decoding
Siyuan Yan
Mo Zhu
Guo-qing Jiang
Jianfei Wang
Jiaxing Chen
...
Xiang Liao
Xiao Cui
Chen Zhang
Zhuoran Song
Ran Zhu
LRM
36
0
0
08 May 2025
LLAMAPIE: Proactive In-Ear Conversation Assistants
Tuochao Chen
Nicholas Batchelder
Alisa Liu
Noah A. Smith
Shyamnath Gollakota
60
0
0
07 May 2025
Diffusion Models are Secretly Exchangeable: Parallelizing DDPMs via Autospeculation
Hengyuan Hu
Aniket Das
Dorsa Sadigh
Nima Anari
DiffM
19
0
0
06 May 2025
AKD : Adversarial Knowledge Distillation For Large Language Models Alignment on Coding tasks
Ilyas Oulkadda
Julien Perez
ALM
32
0
0
05 May 2025
Semantic Probabilistic Control of Language Models
Kareem Ahmed
Catarina G Belém
Padhraic Smyth
Sameer Singh
35
0
0
04 May 2025
Accelerating Large Language Model Reasoning via Speculative Search
Zhihai Wang
Jie Wang
Jilai Pan
Xilin Xia
Huiling Zhen
M. Yuan
Jianye Hao
Feng Wu
ReLM
LRM
57
0
0
03 May 2025
PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding
Bradley McDanel
S. Zhang
Y. Hu
Zining Liu
MoE
47
0
0
02 May 2025
Phantora: Live GPU Cluster Simulation for Machine Learning System Performance Estimation
Jianxing Qin
Jingrong Chen
Xinhao Kong
Yongji Wu
Liang Luo
Z. Wang
Ying Zhang
Tingjun Chen
Alvin R. Lebeck
Danyang Zhuo
48
0
0
02 May 2025
Scaling On-Device GPU Inference for Large Generative Models
Jiuqiang Tang
Raman Sarokin
Ekaterina Ignasheva
Grant Jensen
Lin Chen
Juhyun Lee
Andrei Kulik
Matthias Grundmann
52
0
0
01 May 2025
Bi-directional Model Cascading with Proxy Confidence
David Warren
Mark Dras
44
0
0
27 Apr 2025
Efficient Reasoning for LLMs through Speculative Chain-of-Thought
Jikai Wang
J. Li
Lijun Wu
M. Zhang
LLMAG
LRM
64
1
0
27 Apr 2025
GenTorrent: Scaling Large Language Model Serving with An Overley Network
Fei Fang
Yifan Hua
Shengze Wang
Ruilin Zhou
Y. Liu
Chen Qian
X. Zhang
46
0
0
27 Apr 2025
Platonic Grounding for Efficient Multimodal Language Models
Moulik Choraria
Xinbo Wu
Akhil Bhimaraju
Nitesh Sekhar
Yue Wu
Xu Zhang
Prateek Singhal
L. Varshney
54
0
0
27 Apr 2025
Towards Harnessing the Collaborative Power of Large and Small Models for Domain Tasks
Yang Janet Liu
Bingjie Yan
Tianyuan Zou
Jianqing Zhang
Zixuan Gu
...
J. Li
Xiaozhou Ye
Ye Ouyang
Qiang Yang
Y. Zhang
ALM
77
1
0
24 Apr 2025
Energy Considerations of Large Language Model Inference and Efficiency Optimizations
Jared Fernandez
Clara Na
Vashisth Tiwari
Yonatan Bisk
Sasha Luccioni
Emma Strubell
29
0
0
24 Apr 2025
PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation
Zihao An
Huajun Bai
Z. Liu
Dong Li
E. Barsoum
54
0
0
23 Apr 2025
SplitReason: Learning To Offload Reasoning
Yash Akhauri
Anthony Fei
Chi-chih Chang
Ahmed F. AbouElhamayed
Yueying Li
Mohamed S. Abdelfattah
OffRL
ReLM
LRM
41
0
0
23 Apr 2025
Context-Enhanced Contrastive Search for Improved LLM Text Generation
Jaydip Sen
Rohit Pandey
Hetvi Waghela
46
0
0
22 Apr 2025
Speculative Sampling via Exponential Races
Szymon Kobus
Deniz Gündüz
LRM
30
0
0
21 Apr 2025
Persona-judge: Personalized Alignment of Large Language Models via Token-level Self-judgment
Xiaotian Zhang
Ruizhe Chen
Yang Feng
Zuozhu Liu
40
0
0
17 Apr 2025
Sleep-time Compute: Beyond Inference Scaling at Test-time
Kevin Lin
Charlie Snell
Y. Wang
Charles Packer
Sarah Wooders
Ion Stoica
Joseph E. Gonzalez
34
1
0
17 Apr 2025
Efficient Distributed Retrieval-Augmented Generation for Enhancing Language Model Performance
S. Liu
Zhenzhe Zheng
Xiaoyao Huang
Fan Wu
Guihai Chen
Jie Wu
27
0
0
15 Apr 2025
EMAFusion: A Self-Optimizing System for Seamless LLM Selection and Integration
Soham Shah
Kumar Shridhar
Surojit Chatterjee
Souvik Sen
34
0
0
14 Apr 2025
HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
Avinash Kumar
Shashank Nag
Jason Clemons
L. John
Poulami Das
26
0
0
14 Apr 2025
Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time
Wang Yang
Xiang Yue
V. Chaudhary
Xiaotian Han
ReLM
LRM
65
1
0
12 Apr 2025
Jupiter: Fast and Resource-Efficient Collaborative Inference of Generative LLMs on Edge Devices
Shengyuan Ye
Bei Ouyang
Liekang Zeng
Tianyi Qian
Xiaowen Chu
Jian Tang
Xu Chen
29
0
0
11 Apr 2025
SD
2
^2
2
: Self-Distilled Sparse Drafters
Mike Lasby
Nish Sinnadurai
Valavan Manohararajah
Sean Lie
Vithursan Thangarasa
77
0
0
10 Apr 2025
Resource-efficient Inference with Foundation Model Programs
Lunyiu Nie
Zhimin Ding
Kevin Yu
Marco Cheung
C. Jermaine
S. Chaudhuri
26
0
0
09 Apr 2025
DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding
Hossein Entezari Zarch
Lei Gao
Chaoyi Jiang
Murali Annavaram
LRM
31
0
0
08 Apr 2025
SPIRe: Boosting LLM Inference Throughput with Speculative Decoding
Sanjit Neelam
Daniel Heinlein
Vaclav Cvicek
Akshay Mishra
Reiner Pope
LRM
36
0
0
08 Apr 2025
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
Gleb Rodionov
Roman Garipov
Alina Shutova
George Yakushev
Vage Egiazarian
Anton Sinitsin
Denis Kuznedelev
Dan Alistarh
LRM
27
1
0
08 Apr 2025
PipeDec: Low-Latency Pipeline-based Inference with Dynamic Speculative Decoding towards Large-scale Models
Haofei Yin
Mengbai Xiao
Rouzhou Lu
Xiao Zhang
Dongxiao Yu
Guanghui Zhang
AI4CE
19
0
0
05 Apr 2025
SLOs-Serve: Optimized Serving of Multi-SLO LLMs
Siyuan Chen
Zhipeng Jia
S. Khan
Arvind Krishnamurthy
Phillip B. Gibbons
24
2
0
05 Apr 2025
Scaling Test-Time Inference with Policy-Optimized, Dynamic Retrieval-Augmented Generation via KV Caching and Decoding
Sakhinana Sagar Srinivas
Venkataramana Runkana
OffRL
45
1
0
02 Apr 2025
Efficient Construction of Model Family through Progressive Training Using Model Expansion
Kazuki Yano
Sho Takase
Sosuke Kobayashi
Shun Kiyono
Jun Suzuki
48
0
0
01 Apr 2025
Collaborative LLM Numerical Reasoning with Local Data Protection
Min Zhang
Yuzhe Lu
Yun Zhou
Panpan Xu
Lin Lee Cheong
Chang-Tien Lu
Haozhu Wang
50
0
0
01 Apr 2025
Adaptive Layer-skipping in Pre-trained LLMs
Xuan Luo
Weizhi Wang
Xifeng Yan
83
0
0
31 Mar 2025
Model Hemorrhage and the Robustness Limits of Large Language Models
Ziyang Ma
Z. Li
L. Zhang
Gui-Song Xia
Bo Du
Liangpei Zhang
Dacheng Tao
54
0
0
31 Mar 2025
Speculative End-Turn Detector for Efficient Speech Chatbot Assistant
Hyunjong Ok
Suho Yoo
Jaeho Lee
34
0
0
30 Mar 2025
FastVAR: Linear Visual Autoregressive Modeling via Cached Token Pruning
Hang Guo
Yawei Li
Taolin Zhang
J. Wang
Tao Dai
Shu-Tao Xia
Luca Benini
67
1
0
30 Mar 2025
Token-Driven GammaTune: Adaptive Calibration for Enhanced Speculative Decoding
Aayush Gautam
Susav Shrestha
Narasimha Annapareddy
43
0
0
28 Mar 2025
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
Qingqing Zhao
Yao Lu
Moo Jin Kim
Zipeng Fu
Zhuoyang Zhang
...
Ankur Handa
Ming-Yu Liu
Donglai Xiang
Gordon Wetzstein
Tsung-Yi Lin
LM&Ro
LRM
43
10
0
27 Mar 2025
Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence
Yijiong Yu
LRM
AIMat
90
1
0
26 Mar 2025
PCM : Picard Consistency Model for Fast Parallel Sampling of Diffusion Models
Junhyuk So
Jiwoong Shin
Chaeyeon Jang
Eunhyeok Park
DiffM
46
0
0
25 Mar 2025
BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache
Dayou Du
Shijie Cao
Jianyi Cheng
Ting Cao
M. Yang
MQ
76
0
0
24 Mar 2025
A Novel Hat-Shaped Device-Cloud Collaborative Inference Framework for Large Language Models
Zuan Xie
Yang Xu
Hongli Xu
Yunming Liao
Zhiwei Yao
49
0
0
23 Mar 2025
A Multi-Model Adaptation of Speculative Decoding for Classification
Somnath Roy
Padharthi Sreekar
Srivatsa Narasimha
Anubhav Anand
34
0
0
23 Mar 2025
SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs
Shibo Jie
Yehui Tang
Kai Han
Zhi-Hong Deng
Jing Han
95
0
0
20 Mar 2025
Prada: Black-Box LLM Adaptation with Private Data on Resource-Constrained Devices
Z. Wang
Yexiao He
Zheyu Shen
Yu Li
Guoheng Sun
Myungjin Lee
Ang Li
48
0
0
19 Mar 2025
1
2
3
4
...
8
9
10
Next