v1v2v3v4 (latest)

Speculative Decoding with Big Little Decoder

Neural Information Processing Systems (NeurIPS), 2023

15 February 2023

Sehoon Kim

Suhong Moon

Papers citing "Speculative Decoding with Big Little Decoder"

50 / 103 papers shown

Title
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache Rishabh Tiwari Haocheng Xi Aditya Tomar Coleman Hooper Sehoon Kim Maxwell Horton Mahyar Najibi Michael W. Mahoney Kemal Kurniawan Amir Gholami MQ 193 9 0 05 Feb 2025
Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model AlignmentInternational Conference on Learning Representations (ICLR), 2025 Gregor Bachmann Sotiris Anagnostidis Albert Pumarola Markos Georgopoulos A. Sanakoyeu Yuming Du Edgar Schönfeld Ali K. Thabet Jonas Kohler ALM BDL 350 28 0 31 Jan 2025
Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding TreeAAAI Conference on Artificial Intelligence (AAAI), 2024 Xiangxiang Gao Weisheng Xie Yiwei Xiang Feng Ji 450 14 0 17 Dec 2024
Constrained Decoding with Speculative LookaheadsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024 Nishanth Nakshatri Shamik Roy Rajarshi Das Suthee Chaidaroon Leonid Boytsov Rashmi Gangadharaiah 410 3 0 09 Dec 2024
Software Performance Engineering for Foundation Model-Powered Software (FMware) Haoxiang Zhang Shi Chang Arthur Leung Kishanthan Thangarajah Boyuan Chen Hanan Lutfiyya Ahmed E. Hassan 552 2 0 14 Nov 2024
When Speculation Spills Secrets: Side Channels via Speculative Decoding In LLMs Jiankun Wei Abdulrahman Abdulrazzag Tianchen Zhang Adel Muursepp Gururaj Saileshwar 345 4 0 01 Nov 2024
A Theoretical Perspective for Speculative Decoding AlgorithmNeural Information Processing Systems (NeurIPS), 2024 Ming Yin Minshuo Chen Kaixuan Huang Mengdi Wang 180 20 0 30 Oct 2024
Watermarking Large Language Models and the Generated Content: Opportunities and ChallengesAsilomar Conference on Signals, Systems and Computers (ACSSC), 2024 Ruisi Zhang F. Koushanfar WaLM 229 3 0 24 Oct 2024
AdaEDL: Early Draft Stopping for Speculative Decoding of Large Language Models via an Entropy-based Lower Bound on Token Acceptance Probability Sudhanshu Agrawal Wonseok Jeon Mingu Lee 129 10 0 24 Oct 2024
big.LITTLE Vision Transformer for Efficient Visual Recognition He Guo Yulong Wang Zixuan Ye Jifeng Dai Yuwen Xiong ViT 199 1 0 14 Oct 2024
SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference AccelerationInternational Conference on Learning Representations (ICLR), 2024 Heming Xia Yongqi Li Jun Zhang Cunxiao Du Wenjie Li LRM 305 36 0 09 Oct 2024
A Survey: Collaborative Hardware and Software Design in the Era of Large Language ModelsIEEE Circuits and Systems Magazine (IEEE CSM), 2024 Cong Guo Feng Cheng Zhixu Du James Kiessling Jonathan Ku ... Qilin Zheng Guanglei Zhou Hai Li-Wei Li Yiran Chen 169 17 0 08 Oct 2024
ESPACE: Dimensionality Reduction of Activations for Model CompressionNeural Information Processing Systems (NeurIPS), 2024 Charbel Sakr Brucek Khailany 190 13 0 07 Oct 2024
Efficient Inference for Large Language Model-based Generative RecommendationInternational Conference on Learning Representations (ICLR), 2024 Xinyu Lin Chaoqun Yang Wenjie Wang Yongqi Li Cunxiao Du Fuli Feng See-Kiong Ng Tat-Seng Chua 294 13 0 07 Oct 2024
Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference Zongyue Qin Zifan He Neha Prakriya Jason Cong Yizhou Sun 260 7 0 25 Sep 2024
Multi-Programming Language Ensemble for Code Generation in Large Language Model Tengfei Xue Xuefeng Li Tahir Azim Roman Smirnov Jianhui Yu Arash Sadrieh Babak Pahlavan 196 3 0 06 Sep 2024
Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024 Jerry Huang Prasanna Parthasarathi Mehdi Rezagholizadeh Sarath Chandar 197 5 0 16 Aug 2024
Clover-2: Accurate Inference for Regressive Lightweight Speculative Decoding Bin Xiao Lujun Gui Lei Su Weipeng Chen 172 5 0 01 Aug 2024
Adaptive Draft-Verification for Efficient Large Language Model Decoding Xukun Liu Bowen Lei Ruqi Zhang Dongkuan Xu 214 7 0 27 Jun 2024
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees Yuhui Li Fangyun Wei Chao Zhang Hongyang R. Zhang 346 171 0 24 Jun 2024
LiveMind: Low-latency Large Language Models with Simultaneous Inference Chuangtao Chen Grace Li Zhang Xunzhao Yin Cheng Zhuo Ulf Schlichtmann Bing Li LRM 267 7 0 20 Jun 2024
Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving Ke Cheng Wen Hu Zhi Wang Hongen Peng Jianguo Li Sheng Zhang 151 14 0 19 Jun 2024
Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding Kaiyan Zhang Jianyu Wang Ning Ding Biqing Qi Ermo Hua Xingtai Lv Bowen Zhou 287 14 0 18 Jun 2024
Enabling Efficient Batch Serving for LMaaS via Generation Length Prediction Ke Cheng Wen Hu Zhi Wang Peng Du Jianguo Li Sheng Zhang 241 16 0 07 Jun 2024
Fast yet Safe: Early-Exiting with Risk Control Metod Jazbec Alexander Timans Tin Hadvzi Veljković K. Sakmann Dan Zhang C. A. Naesseth Eric T. Nalisnick 238 12 0 31 May 2024
SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths Kaixuan Huang Xudong Guo M. Y. Wang 433 38 0 30 May 2024
Faster Cascades via Speculative Decoding Harikrishna Narasimhan Wittawat Jitkrittum A. S. Rawat Seungyeon Kim Neha Gupta A. Menon Sanjiv Kumar LRM 316 19 0 29 May 2024
Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass Ethan Shen Alan Fan Sarah M Pratt Jae Sung Park Matthew Wallingford Sham Kakade Ari Holtzman Ranjay Krishna Ali Farhadi Aditya Kusupati 302 4 0 28 May 2024
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference Hao Mark Chen Wayne Luk Ka-Fai Cedric Yiu Rui Li Konstantin Mishchenko Stylianos I. Venieris Hongxiang Fan 213 14 0 28 May 2024
Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model InferenceInternational Conference on Learning Representations (ICLR), 2024 Nadav Timor Jonathan Mamou Daniel Korat Moshe Berchansky Oren Pereg Moshe Wasserblat Tomer Galanti Michal Gordon David Harel LRM 200 6 0 23 May 2024
A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models Mahsa Khoshnoodi Vinija Jain Mingye Gao Malavika Srikanth Vasu Sharma OffRL 288 7 0 15 May 2024
Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge Bin Xiao Chunan Shi Xiaonan Nie Fan Yang Xiangwei Deng Lei Su Weipeng Chen Tengjiao Wang 224 10 0 01 May 2024
Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing Dujian Ding Ankur Mallick Chi Wang Robert Sim Subhabrata Mukherjee Victor Rühle L. Lakshmanan Ahmed Hassan Awadallah 331 174 0 22 Apr 2024
Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity Tyler Griggs Xiaoxuan Liu Jiaxiang Yu Doyoung Kim Wei-Lin Chiang Alvin Cheung Ion Stoica 249 23 0 22 Apr 2024
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding Hanshi Sun Zhuoming Chen Xinyu Yang Yuandong Tian Beidi Chen 301 83 0 18 Apr 2024
Exploring and Improving Drafts in Blockwise Parallel Decoding Taehyeon Kim A. Suresh Kishore Papineni Michael Riley Sanjiv Kumar Adrian Benton AI4TS 241 4 0 14 Apr 2024
Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel DecodingNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024 Jie Ou Yueming Chen Wenhong Tian 232 21 0 10 Apr 2024
The Larger the Better? Improved LLM Code-Generation via Budget Reallocation Michael Hassid Tal Remez Jonas Gehring Roy Schwartz Yossi Adi 245 40 0 31 Mar 2024
LLM Inference Unveiled: Survey and Roofline Model Insights Zhihang Yuan Yuzhang Shang Yang Zhou Zhen Dong Zhe Zhou ... Yong Jae Lee Yan Yan Beidi Chen Guangyu Sun Kurt Keutzer 531 143 0 26 Feb 2024
Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens Huiping Zhuang Jiahong Yu Qianshi Pang Zihao Wang Huiping Zhuang Cen Chen Xiaofeng Zou 214 5 0 24 Feb 2024
Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding Hanling Yi Feng-Huei Lin Hongbin Li Peiyang Ning Xiaotian Yu Rong Xiao LRM 267 21 0 19 Feb 2024
Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs Yeonhong Park Jake Hyun SangLyul Cho Bonggeun Sim Jae W. Lee MQ 273 38 0 16 Feb 2024
Tandem Transformers for Inference Efficient LLMs S. AishwaryaP Pranav Ajit Nair Yashas Samaga Toby Boyd Sanjiv Kumar Prateek Jain Praneeth Netrapalli 152 10 0 13 Feb 2024
GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding Cunxiao Du Jing Jiang Yuanchen Xu Jiawei Wu Sicheng Yu ... Shenggui Li Kai Xu Liqiang Nie Zhaopeng Tu Yang You 191 57 0 03 Feb 2024
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding Yichao Fu Peter Bailis Ion Stoica Hao Zhang 325 231 0 03 Feb 2024
EAGLE: Speculative Sampling Requires Rethinking Feature UncertaintyInternational Conference on Machine Learning (ICML), 2024 Yuhui Li Fangyun Wei Chao Zhang Hongyang R. Zhang 462 295 0 26 Jan 2024
Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative DecodingAnnual Meeting of the Association for Computational Linguistics (ACL), 2024 Heming Xia Zhe Yang Qingxiu Dong Peiyi Wang Chak Tou Leong Tao Ge Tianyu Liu Wenjie Li Zhifang Sui LRM 369 196 0 15 Jan 2024
EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D ParallelismInternational Conference on Machine Learning (ICML), 2023 Yanxi Chen Xuchen Pan Yaliang Li Bolin Ding Jingren Zhou LRM 390 53 0 08 Dec 2023
Efficient Deep Speech Understanding at the Edge Rongxiang Wang Felix Lin 146 1 0 22 Nov 2023
Speculative Contrastive DecodingAnnual Meeting of the Association for Computational Linguistics (ACL), 2023 Hongyi Yuan Keming Lu Fei Huang Zheng Yuan Chang Zhou 139 8 0 15 Nov 2023