Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2211.17192
Cited By
v1
v2 (latest)
Fast Inference from Transformers via Speculative Decoding
International Conference on Machine Learning (ICML), 2022
30 November 2022
Yaniv Leviathan
Matan Kalman
Yossi Matias
LRM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (9 upvotes)
Papers citing
"Fast Inference from Transformers via Speculative Decoding"
50 / 763 papers shown
Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding
Zijian Lin
Yang Zhang
Yougen Yuan
Yuming Yan
Jinjiang Liu
Zhiyong Wu
Pengfei Hu
Qun Yu
316
1
0
21 May 2025
SSR: Speculative Parallel Scaling Reasoning in Test-time
Yuanlin Chu
Bo Wang
Xiang Liu
Hong Chen
Aiwei Liu
Xuming Hu
ReLM
LRM
343
2
0
21 May 2025
BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms
Yunlong Hou
Fengzhuo Zhang
Cunxiao Du
Xuan Zhang
Jiachun Pan
Tianyu Pang
Chao Du
Vincent Y. F. Tan
Zhuoran Yang
OffRL
449
7
0
21 May 2025
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective
Siyue Zhang
Yilun Zhao
Liyuan Geng
Arman Cohan
Anh Tuan Luu
Chen Zhao
233
7
0
21 May 2025
The Energy Cost of Reasoning: Analyzing Energy Usage in LLMs with Test-time Compute
Yunho Jin
Gu-Yeon Wei
David Brooks
LRM
422
7
0
20 May 2025
Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency
International Joint Conference on Artificial Intelligence (IJCAI), 2025
Ruixiao Li
Fahao Chen
Peng Li
356
0
0
20 May 2025
Alignment-Augmented Speculative Decoding with Alignment Sampling and Conditional Verification
Jikai Wang
Zhenxu Tian
Jilong Li
Qingrong Xia
Xinyu Duan
Zhefeng Wang
Baoxing Huai
Min Zhang
294
3
0
19 May 2025
Accelerating Adaptive Retrieval Augmented Generation via Instruction-Driven Representation Reduction of Retrieval Overlaps
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Jie Ou
Jinyu Guo
Shuaihong Jiang
Zhaokun Wang
Libo Qin
Shunyu Yao
Wenhong Tian
3DV
519
4
0
19 May 2025
Policy Contrastive Decoding for Robotic Foundation Models
Shihan Wu
Ji Zhang
Xu Luo
Junlin Xie
Jingkuan Song
Heng Tao Shen
Lianli Gao
OffRL
849
2
0
19 May 2025
FLASH: Latent-Aware Semi-Autoregressive Speculative Decoding for Multimodal Tasks
Zihua Wang
Ruibo Li
Haozhe Du
Joey Tianyi Zhou
Yu Zhang
Xu Yang
MLLM
421
1
0
19 May 2025
HybridServe: Efficient Serving of Large AI Models with Confidence-Based Cascade Routing
Leyang Xue
Yao Fu
Luo Mai
Mahesh K. Marina
339
1
0
18 May 2025
Communication-Efficient Hybrid Language Model via Uncertainty-Aware Opportunistic and Compressed Transmission
Seungeun Oh
Jinhyuk Kim
Jihong Park
Seung-Woo Ko
Jinho Choi
Tony Q. S. Quek
Seong-Lyun Kim
275
1
0
17 May 2025
SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs
Jinwoo Park
Seunggeun Cho
Dongsu Han
334
3
0
16 May 2025
MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models
Mugilan Ganesan
Siyang Song
Ankur Aggarwal
Nish Sinnadurai
Sean Lie
Vithursan Thangarasa
VLM
439
0
0
15 May 2025
Automatic Task Detection and Heterogeneous LLM Speculative Decoding
Danying Ge
Jianhua Gao
Qizhi Jiang
Yifei Feng
Weixing Ji
231
0
0
13 May 2025
SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models
Hang Wu
Jianian Zhu
Yongqian Li
Haojie Wang
Biao Hou
Jidong Zhai
425
1
0
12 May 2025
Making Small Language Models Efficient Reasoners: Intervention, Supervision, Reinforcement
Xuechen Zhang
Zijian Huang
Chenshun Ni
Ziyang Xiong
Jiasi Chen
Samet Oymak
ReLM
LRM
592
7
0
12 May 2025
Overflow Prevention Enhances Long-Context Recurrent LLMs
Assaf Ben-Kish
Itamar Zimerman
M. Jehanzeb Mirza
James R. Glass
James Glass
Leonid Karlinsky
Raja Giryes
LRM
401
3
0
12 May 2025
MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining
Xiaomi LLM-Core Team
Bingquan Xia
Bo Shen
Cici
Dawei Zhu
...
Yun Wang
Yue Yu
Zhenru Lin
Zhichao Song
Zihao Yue
MoE
ReLM
LRM
AI4CE
566
48
0
12 May 2025
Scaling Laws for Speculative Decoding
Siyuan Yan
Mo Zhu
Guo-qing Jiang
Jianfei Wang
Jiaxing Chen
...
Xiang Liao
Xiao Cui
Chen Zhang
Zhuoran Song
Ran Zhu
LRM
345
1
0
08 May 2025
Scalable LLM Math Reasoning Acceleration with Low-rank Distillation
Harry Dong
Bilge Acun
Beidi Chen
Yuejie Chi
LRM
304
4
0
08 May 2025
LLAMAPIE: Proactive In-Ear Conversation Assistants
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Tuochao Chen
Nicholas Batchelder
Alisa Liu
Noah A. Smith
Shyamnath Gollakota
921
1
0
07 May 2025
Diffusion Models are Secretly Exchangeable: Parallelizing DDPMs via Autospeculation
Hengyuan Hu
Aniket Das
Dorsa Sadigh
Nima Anari
DiffM
347
5
0
06 May 2025
AKD : Adversarial Knowledge Distillation For Large Language Models Alignment on Coding tasks
Ilyas Oulkadda
Julien Perez
ALM
216
0
0
05 May 2025
Semantic Probabilistic Control of Language Models
Kareem Ahmed
Catarina G Belém
Padhraic Smyth
Sameer Singh
306
4
0
04 May 2025
Accelerating Large Language Model Reasoning via Speculative Search
Zhihai Wang
Jie Wang
Jilai Pan
Xilin Xia
Huiling Zhen
Mingxuan Yuan
Jianye Hao
Feng Wu
ReLM
LRM
305
12
0
03 May 2025
PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Bradley McDanel
Shanghang Zhang
Y. Hu
Zining Liu
MoE
934
2
0
02 May 2025
Phantora: Maximizing Code Reuse in Simulation-based Machine Learning System Performance Estimation
Jianxing Qin
Jingrong Chen
Xinhao Kong
Yongji Wu
Liang Luo
...
Ying Zhang
Ying Zhang
Tingjun Chen
Alvin R. Lebeck
Danyang Zhuo
638
0
0
02 May 2025
Scaling On-Device GPU Inference for Large Generative Models
Jiuqiang Tang
Raman Sarokin
Ekaterina Ignasheva
Grant Jensen
Lin Chen
Juhyun Lee
Andrei Kulik
Matthias Grundmann
632
10
0
01 May 2025
Efficient Reasoning for LLMs through Speculative Chain-of-Thought
Jikai Wang
Junlin Li
Jianye Hou
Hao Fei
Lijun Wu
Min Zhang
LLMAG
LRM
348
13
0
27 Apr 2025
PlanetServe: A Decentralized, Scalable, and Privacy-Preserving Overlay for Democratizing Large Language Model Serving
Fei Fang
Yifan Hua
Shengze Wang
Ruilin Zhou
Y. Liu
Chen Qian
Wei Wei
486
3
0
27 Apr 2025
DeepInsert: Early Layer Bypass for Efficient and Performant Multimodal Understanding
Moulik Choraria
Xinbo Wu
Akhil Bhimaraju
Nitesh Sekhar
Yue Wu
Xu Zhang
Prateek Singhal
Lav Varshney
362
0
0
27 Apr 2025
Bi-directional Model Cascading with Proxy Confidence
David Warren
Mark Dras
285
1
0
27 Apr 2025
Towards Harnessing the Collaborative Power of Large and Small Models for Domain Tasks
Yang Liu
Bingjie Yan
Tianyuan Zou
Jianqing Zhang
Zixuan Gu
...
Jiajian Li
Xiaozhou Ye
Ye Ouyang
Qiang Yang
Yanzhe Zhang
ALM
1.0K
4
0
24 Apr 2025
Energy Considerations of Large Language Model Inference and Efficiency Optimizations
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Jared Fernandez
Clara Na
Vashisth Tiwari
Yonatan Bisk
Sasha Luccioni
Emma Strubell
492
19
0
24 Apr 2025
PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation
Zihao An
Huajun Bai
Ziqiang Liu
Dong Li
E. Barsoum
482
1
0
23 Apr 2025
SplitReason: Learning To Offload Reasoning
Yash Akhauri
Anthony Fei
Chi-chih Chang
Ahmed F. AbouElhamayed
Yueying Li
Mohamed S. Abdelfattah
OffRL
ReLM
LRM
266
4
0
23 Apr 2025
Context-Enhanced Contrastive Search for Improved LLM Text Generation
Jaydip Sen
Rohit Pandey
Hetvi Waghela
322
4
0
22 Apr 2025
Speculative Sampling via Exponential Races
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Szymon Kobus
Deniz Gündüz
LRM
189
0
0
21 Apr 2025
Thought Manipulation: External Thought Can Be Efficient for Large Reasoning Models
Yule Liu
Jingyi Zheng
Zhen Sun
Zifan Peng
Wenhan Dong
Zeyang Sha
Shiwen Cui
Weiqiang Wang
Xinlei He
OffRL
LRM
331
20
0
18 Apr 2025
Persona-judge: Personalized Alignment of Large Language Models via Token-level Self-judgment
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Xiaotian Zhang
Ruizhe Chen
Yang Feng
Zuozhu Liu
376
4
0
17 Apr 2025
Sleep-time Compute: Beyond Inference Scaling at Test-time
Kevin Lin
Charlie Snell
Longji Xu
Charles Packer
Sarah Wooders
Eric Liang
Alfons Kemper
320
17
0
17 Apr 2025
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float (DFloat11)
Tianyi Zhang
Yang Sui
Shaochen Zhong
Vipin Chaudhary
Helen Zhou
Xia Hu
Anshumali Shrivastava
MQ
286
10
0
15 Apr 2025
Efficient Distributed Retrieval-Augmented Generation for Enhancing Language Model Performance
Shixuan Liu
Zhenzhe Zheng
Xiaoyao Huang
Fan Wu
Guihai Chen
Jie Wu
329
1
0
15 Apr 2025
EMAFusion: A Self-Optimizing System for Seamless LLM Selection and Integration
Soham Shah
Kumar Shridhar
Surojit Chatterjee
Souvik Sen
281
0
0
14 Apr 2025
HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
Avinash Kumar
Shashank Nag
Jason Clemons
L. John
Poulami Das
466
1
0
14 Apr 2025
Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time
Wang Yang
Xiang Yue
Vipin Chaudhary
Xiaotian Han
ReLM
LRM
318
34
0
12 Apr 2025
Jupiter: Fast and Resource-Efficient Collaborative Inference of Generative LLMs on Edge Devices
IEEE Conference on Computer Communications (IEEE INFOCOM), 2025
Shengyuan Ye
Bei Ouyang
Liekang Zeng
Tianyi Qian
Xiaowen Chu
Jian Tang
Xu Chen
370
12
0
11 Apr 2025
SD
2
^2
2
: Self-Distilled Sparse Drafters
Mike Lasby
Nish Sinnadurai
Valavan Manohararajah
Sean Lie
Yani Andrew Ioannou
Vithursan Thangarasa
790
1
0
10 Apr 2025
Resource-efficient Inference with Foundation Model Programs
Lunyiu Nie
Zhimin Ding
Kevin Yu
Marco Cheung
C. Jermaine
S. Chaudhuri
281
1
0
09 Apr 2025
Previous
1
2
3
...
5
6
7
...
14
15
16
Next