Speculative Streaming: Fast LLM Inference without Auxiliary Models

16 February 2024

Papers citing "Speculative Streaming: Fast LLM Inference without Auxiliary Models"

6 / 6 papers shown

Title
Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition Artem Basharin Andrei Chertkov Ivan V. Oseledets 34 1 0 23 Oct 2024
Extracting Paragraphs from LLM Token Activations Nicholas Pochinkov Angelo Benoit Lovkush Agarwal Zainab Ali Majid Lucile Ter-Minassian 30 1 0 10 Sep 2024
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference Qichen Fu Minsik Cho Thomas Merth Sachin Mehta Mohammad Rastegari Mahyar Najibi 28 25 0 19 Jul 2024
S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs Wei Zhong Manasa Bharadwaj 31 5 0 30 May 2024
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding Yichao Fu Peter Bailis Ion Stoica Hao Zhang 123 134 0 03 Feb 2024
Scaling Laws for Neural Language Models Jared Kaplan Sam McCandlish T. Henighan Tom B. Brown B. Chess R. Child Scott Gray Alec Radford Jeff Wu Dario Amodei 220 3,054 0 23 Jan 2020