v1v2 (latest)

Fast Inference from Transformers via Speculative Decoding

International Conference on Machine Learning (ICML), 2022

30 November 2022

Yaniv Leviathan

Matan Kalman

Yossi Matias

LRM

ArXiv (abs)PDF HTML HuggingFace (9 upvotes)

Papers citing "Fast Inference from Transformers via Speculative Decoding"

50 / 763 papers shown

TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation

310

26 Feb 2025

Towards Optimal Multi-draft Speculative DecodingInternational Conference on Learning Representations (ICLR), 2025

290

26 Feb 2025

CORAL: Learning Consistent Representations across Multi-step Training with Lighter Speculative DrafterAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

652

24 Feb 2025

FastCoder: Accelerating Repository-level Code Generation via Efficient Retrieval and Verification

329

24 Feb 2025

Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding

Jonathan Ragan-Kelley

Suvinay Subramanian

Michael Carbin

356

24 Feb 2025

LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification

315

24 Feb 2025

Dynamic Parallel Tree Search for Efficient LLM ReasoningAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

...

488

22 Feb 2025

PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing SystemInternational Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2025

385

21 Feb 2025

TETRIS: Optimal Draft Token Selection for Batch Speculative DecodingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

Bryan Kian Hsiang Low

348

21 Feb 2025

DReSD: Dense Retrieval for Speculative DecodingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

523

21 Feb 2025

Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders

...

440

21 Feb 2025

Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models

261

21 Feb 2025

Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMsInternational Conference on Learning Representations (ICLR), 2025

334

21 Feb 2025

C2T: A Classifier-Based Tree Construction Method in Speculative Decoding

191

20 Feb 2025

SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

539

18 Feb 2025

Translate Smart, not Hard: Cascaded Translation Systems with Quality-Aware Deferral

339

18 Feb 2025

Language Models Can Predict Their Own Behavior

Dhananjay Ashok

Jonathan May

AI4TS ReLM LRM

426

18 Feb 2025

SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

487

17 Feb 2025

Energy-Conscious LLM Decoding: Impact of Text Generation Strategies on GPU Energy Consumption

Alireza Nik

Pål Halvorsen

297

17 Feb 2025

SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer

294

16 Feb 2025

Hybrid Offline-online Scheduling Method for Large Language Model Inference Optimization

277

14 Feb 2025

Theoretical Benefit and Limitation of Diffusion Language Model

375

13 Feb 2025

Auditing Prompt Caching in Language Model APIs

355

11 Feb 2025

Speculate, then Collaborate: Fusing Knowledge of Language Models during Decoding

427

11 Feb 2025

LANTERN++: Enhancing Relaxed Speculative Decoding with Static Tree Drafting for Visual Auto-regressive Models

351

10 Feb 2025

Acceleration Multiple Heads Decoding for LLM via Dynamic Tree Attention

Zhendong Zhang

150

09 Feb 2025

Towards Sustainable NLP: Insights from Benchmarking Inference Energy in Large Language ModelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025

370

08 Feb 2025

Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative DecodingNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025

521

08 Feb 2025

Entropy Adaptive Decoding: Dynamic Model Switching for Efficient Inference

Toby Simonds

278

05 Feb 2025

QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache

251

05 Feb 2025

Intelligent Sensing-to-Action for Robust Autonomy at the Edge: Opportunities and ChallengesDesign, Automation and Test in Europe (DATE), 2025

...

387

04 Feb 2025

M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference

453

04 Feb 2025

Position: AI Scaling: From Up to Down and Out

519

02 Feb 2025

Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model AlignmentInternational Conference on Learning Representations (ICLR), 2025

434

31 Jan 2025

Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies

658

31 Jan 2025

Safeguarding Privacy in Edge Speech Understanding with Tiny Foundation Models

A. Benazir

Felix Xiaozhu Lin

329

29 Jan 2025

TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language ModelsInternational Conference on Learning Representations (ICLR), 2025

582

28 Jan 2025

Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs

449

28 Jan 2025

Predicting Compact Phrasal Rewrites with Large Language Models for ASR Post EditingIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025

23 Jan 2025

Toyteller: AI-powered Visual Storytelling Through Toy-Playing with Character SymbolsInternational Conference on Human Factors in Computing Systems (CHI), 2025

John Joon Young Chung

Melissa Roemmele

Max Kreminski

VGen

302

23 Jan 2025

AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding

...

308

21 Jan 2025

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

310

252

10 Jan 2025

DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition

277

03 Jan 2025

Towards Sustainable Large Language Model ServingACM SIGEnergy Energy Informatics Review (SEIR), 2024

490

31 Dec 2024

A novel framework for MCDM based on Z numbers and soft likelihood function

Yuanpeng He

220

26 Dec 2024

SlimGPT: Layer-wise Structured Pruning for Large Language ModelsNeural Information Processing Systems (NeurIPS), 2024

221

24 Dec 2024

Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels

...

172

24 Dec 2024

SYMPHONY: Improving Memory Management for LLM Inference Workloads

Saurabh Agarwal

Anyong Mao

Aditya Akella

Shivaram Venkataraman

LLMAG

237

21 Dec 2024

Parallelized Autoregressive Visual GenerationComputer Vision and Pattern Recognition (CVPR), 2024

649

19 Dec 2024

Deploying Foundation Model Powered Agent Services: A Survey

...

475

18 Dec 2024