v1v2 (latest)

Fast Inference from Transformers via Speculative Decoding

International Conference on Machine Learning (ICML), 2022

30 November 2022

Yaniv Leviathan

Matan Kalman

Yossi Matias

LRM

ArXiv (abs)PDF HTML HuggingFace (9 upvotes)

Papers citing "Fast Inference from Transformers via Speculative Decoding"

50 / 763 papers shown

Towards On-Device Personalization: Cloud-device Collaborative Data Augmentation for Efficient On-device Language Model

136

29 Aug 2025

Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution

160

28 Aug 2025

History Rhymes: Accelerating LLM Reinforcement Learning with RhymeRL

149

26 Aug 2025

Speculative Safety-Aware Decoding

Xuekang Wang

Shengyu Zhu

Xueqi Cheng

174

25 Aug 2025

SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning

241

22 Aug 2025

Hardwired-Neurons Language Processing Units as General-Purpose Cognitive Substrates

...

120

22 Aug 2025

GPT-OSS-20B: A Comprehensive Deployment-Centric Analysis of OpenAI's Open-Weight Mixture of Experts Model

193

22 Aug 2025

Confidence-Modulated Speculative Decoding for Large Language Models

295

21 Aug 2025

WISCA: A Lightweight Model Transition Method to Improve LLM Training via Weight Scaling

...

113

21 Aug 2025

Measuring the environmental impact of delivering AI at Google Scale

...

Parthasarathy Ranganathan

112

21 Aug 2025

Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner

246

20 Aug 2025

A Comparative Study of Decoding Strategies in Medical Text Generation

125

19 Aug 2025

Energy-Efficient Wireless LLM Inference via Uncertainty and Importance-Aware Speculative Decoding

Jihoon Park

Seungeun Oh

Seong-Lyun Kim

18 Aug 2025

Cost-Aware Contrastive Routing for LLMs

313

17 Aug 2025

Dynamic Quality-Latency Aware Routing for LLM Inference in Wireless Edge-Device Networks

15 Aug 2025

Novel Parasitic Dual-Scale Modeling for Efficient and Accurate Multilingual Speech Translation

15 Aug 2025

NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

...

202

14 Aug 2025

READER: Retrieval-Assisted Drafter for Efficient LLM Inference

163

12 Aug 2025

ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs

127

12 Aug 2025

Grouped Speculative Decoding for Autoregressive Image Generation

100

11 Aug 2025

OverFill: Two-Stage Models for Efficient Language Model Decoding

Woojeong Kim

Junxiong Wang

Jing Nathan Yan

Mohamed S. Abdelfattah

Alexander M Rush

108

11 Aug 2025

Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions

...

256

11 Aug 2025

Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation

188

11 Aug 2025

CARD: A Cache-Assisted Parallel Speculative Decoding Framework via Query-and-Correct Paradigm for Accelerating LLM Inference

174

06 Aug 2025

An Efficient and Adaptive Next Edit Suggestion Framework with Zero Human Instructions in IDEs

121

04 Aug 2025

SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference

209

03 Aug 2025

Fast and scalable retrosynthetic planning with a transformer neural network and speculative beam search

02 Aug 2025

Optimal Scheduling Algorithms for LLM Inference: Theory and PracticeProceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS), 2025

Agrim Bari

Parikshit Hegde

G. Veciana

165

01 Aug 2025

XSpecMesh: Quality-Preserving Auto-Regressive Mesh Generation Acceleration via Multi-Head Speculative Decoding

221

31 Jul 2025

Spec-VLA: Speculative Decoding for Vision-Language-Action Models with Relaxed Acceptance

187

30 Jul 2025

Hierarchical Verification of Speculative Beams for Accelerating LLM Inference

Jaydip Sen

Harshitha Puvvala

Subhasis Dasgupta

165

30 Jul 2025

Model-free Speculative Decoding for Transformer-based ASR with Token Map Drafting

102

29 Jul 2025

SpecASR: Accelerating LLM-based Automatic Speech Recognition via Speculative DecodingDesign Automation Conference (DAC), 2025

247

24 Jul 2025

GATEBLEED: Exploiting On-Core Accelerator Power Gating for High Performance & Stealthy Attacks on AI

Samira Mirbagher Ajorpaz

282

22 Jul 2025

STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models

141

21 Jul 2025

ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning

203

14 Jul 2025

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

...

296

14 Jul 2025

TPP-SD: Accelerating Transformer Point Process Sampling with Speculative Decoding

177

12 Jul 2025

BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

260

11 Jul 2025

OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding

342

03 Jul 2025

Cautious Next Token Prediction

225

03 Jul 2025

VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs

...

116

28 Jun 2025

Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models

221

23 Jun 2025

PARALLELPROMPT: Extracting Parallelism from Large Language Model Queries

140

23 Jun 2025

Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?

169

20 Jun 2025

PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction

Shufan Li

Aditya Grover

245

18 Jun 2025

MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

Bryan Kian Hsiang Low

Paul Liang

LLMAG OffRL LRM

391

18 Jun 2025

^4

C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models

198

17 Jun 2025

Multimodal Large Language Models-Enabled UAV Swarm: Towards Efficient and Intelligent Autonomous Aerial Systems

...

173

15 Jun 2025

$$\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts$

\texttt{SPECS}

: Faster Test-Time Scaling through Speculative Drafts

213

15 Jun 2025