v1v2 (latest)

Fast Inference from Transformers via Speculative Decoding

International Conference on Machine Learning (ICML), 2022

30 November 2022

Yaniv Leviathan

Matan Kalman

Yossi Matias

LRM

ArXiv (abs)PDF HTML HuggingFace (9 upvotes)

Papers citing "Fast Inference from Transformers via Speculative Decoding"

50 / 763 papers shown

Uncertainty-Aware Hybrid Inference with On-Device Small and Remote Large Language Models

337

17 Dec 2024

RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within GenerationAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

408

16 Dec 2024

NITRO: LLM Inference on Intel Laptop NPUs

Anthony Fei

Mohamed S. Abdelfattah

127

15 Dec 2024

Constrained Decoding with Speculative LookaheadsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

456

09 Dec 2024

CPTQuant -- A Novel Mixed Precision Post-Training Quantization Techniques for Large Language Models

Amitash Nanda

Sree Bhargavi Balija

D. Sahoo

269

03 Dec 2024

PLD+: Accelerating LLM inference by leveraging Language Model ArtifactsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

Shwetha Somasundaram

Anirudh Phukan

Apoorv Saxena

370

02 Dec 2024

Neutralizing Backdoors through Information Conflicts for Large Language Models

382

27 Nov 2024

Speculative Decoding with CTC-based Draft Model for LLM Inference AccelerationNeural Information Processing Systems (NeurIPS), 2024

Zhuofan Wen

Shangtong Gui

Yang Feng

407

25 Nov 2024

Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding

Hyun Ryu

Eric Kim

358

20 Nov 2024

Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering

...

590

18 Nov 2024

Debiasing Watermarks for Large Language Models via Maximal CouplingJournal of the American Statistical Association (JASA), 2024

355

17 Nov 2024

SAM Decoding: Speculative Decoding via Suffix AutomatonAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

478

16 Nov 2024

SpecHub: Provable Acceleration to Multi-Draft Speculative DecodingConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

213

08 Nov 2024

SSSD: Simply-Scalable Speculative Decoding

341

08 Nov 2024

SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications

350

07 Nov 2024

The N-Grammys: Accelerating Autoregressive Inference with Learning-Free Batched Speculation

Lawrence Stewart

Matthew Trager

Sujan Kumar Gonugondla

Stefano Soatto

223

06 Nov 2024

When Speculation Spills Secrets: Side Channels via Speculative Decoding In LLMs

Jiankun Wei

Abdulrahman Abdulrazzag

Tianchen Zhang

Adel Muursepp

Gururaj Saileshwar

425

01 Nov 2024

Interpretable Next-token Prediction via the Generalized Induction Head

371

31 Oct 2024

Accelerated AI Inference via Dynamic Execution Methods

249

30 Oct 2024

A Theoretical Perspective for Speculative Decoding AlgorithmNeural Information Processing Systems (NeurIPS), 2024

217

30 Oct 2024

The Impact of Inference Acceleration on Bias of LLMsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

356

29 Oct 2024

Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative DecodingIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

318

29 Oct 2024

ProMoE: Fast MoE-based LLM Serving using Proactive Caching

488

29 Oct 2024

Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments

296

28 Oct 2024

Transferable Post-training via Inverse Value LearningNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

239

28 Oct 2024

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

459

28 Oct 2024

Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRAInternational Conference on Learning Representations (ICLR), 2024

396

28 Oct 2024

FIRP: Faster LLM inference via future intermediate representation predictionNatural Language Processing and Chinese Computing (NLPCC), 2024

Jingang Wang

107

27 Oct 2024

Fast Best-of-N Decoding via Speculative RejectionNeural Information Processing Systems (NeurIPS), 2024

Ruiqi Zhang

378

101

26 Oct 2024

Dynamic layer selection in decoder-only transformers

288

26 Oct 2024

Watermarking Large Language Models and the Generated Content: Opportunities and ChallengesAsilomar Conference on Signals, Systems and Computers (ACSSC), 2024

Ruisi Zhang

F. Koushanfar

WaLM

291

24 Oct 2024

AdaEDL: Early Draft Stopping for Speculative Decoding of Large Language Models via an Entropy-based Lower Bound on Token Acceptance Probability

Sudhanshu Agrawal

Wonseok Jeon

Mingu Lee

144

24 Oct 2024

Multi-Draft Speculative Sampling: Canonical Decomposition and Theoretical LimitsInternational Conference on Learning Representations (ICLR), 2024

Ashish Khisti

MohammadReza Ebrahimi

334

23 Oct 2024

Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition

Artem Basharin

Andrei Chertkov

Ivan Oseledets

407

23 Oct 2024

Coarse-to-Fine Highlighting: Reducing Knowledge Hallucination in Large Language ModelsInternational Conference on Machine Learning (ICML), 2024

Qitan Lv

Jie Wang

Hanzhu Chen

Bin Li

Yongdong Zhang

Feng Wu

HILM

344

19 Oct 2024

MoDification: Mixture of Depths Made EasyNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

...

Min Zhang

204

18 Oct 2024

TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling

...

489

18 Oct 2024

Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative DecodingIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

272

17 Oct 2024

Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement

139

17 Oct 2024

Learning to Route LLMs with Confidence Tokens

Yu-Neng Chuang

Helen Zhou

Prathusha Kameswara Sarma

287

17 Oct 2024

DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure

199

15 Oct 2024

Learning from Imperfect Data: Towards Efficient Knowledge Distillation of Autoregressive Language Models for Text-to-SQLConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

163

15 Oct 2024

QSpec: Speculative Decoding with Complementary Quantization Schemes

440

15 Oct 2024

Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

598

15 Oct 2024

Customize Your Visual Autoregressive Recipe with Set Autoregressive Modeling

Xiangyu Yue

227

14 Oct 2024

Probabilistic Degeneracy Detection for Point-to-Plane Error MinimizationIEEE Robotics and Automation Letters (RA-L), 2024

Johan Hatleskog

Kostas Alexis

3DPC

406

14 Oct 2024

Self-Data Distillation for Recovering Quality in Pruned Large Language Models

493

13 Oct 2024

COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement

895

12 Oct 2024

QEFT: Quantization for Efficient Fine-Tuning of LLMsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Changhun Lee

Jun-gyu Jin

Eunhyeok Park

214

11 Oct 2024

KV Prediction for Improved Time to First Token

240

10 Oct 2024