v1v2 (latest)

Hydragen: High-Throughput LLM Inference with Shared Prefixes

7 February 2024

ArXiv (abs)PDF HTML HuggingFace (20 upvotes)Github (49★)

Papers citing "Hydragen: High-Throughput LLM Inference with Shared Prefixes"

35 / 35 papers shown

On the Role of Temperature Sampling in Test-Time Scaling

175

02 Oct 2025

TyphoonMLA: A Mixed Naive-Absorb MLA Kernel For Shared Prefix

Ahmet Caner Yüzügüler

Ahmet Çelik

Jiawei Zhuang

Lukas Cavigelli

207

25 Sep 2025

Learned Structure in Cartridges: Keys as Shareable Routers in Self-Studied Representations

Maurizio Diaz

190

23 Aug 2025

Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions

...

388

11 Aug 2025

Optimal Scheduling Algorithms for LLM Inference: Theory and PracticeProceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS), 2025

Agrim Bari

Parikshit Hegde

G. Veciana

241

01 Aug 2025

CaliDrop: KV Cache Compression with Calibration

259

26 Jul 2025

ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism

622

14 Jul 2025

Kinetics: Rethinking Test-Time Scaling Laws

499

05 Jun 2025

SpecMemo: Speculative Decoding is in Your Pocket

Selin Yildirim

Deming Chen

227

16 May 2025

Accurate KV Cache Quantization with Outlier Tokens TracingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

376

16 May 2025

MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints

349

12 Apr 2025

Queueing, Predictions, and LLMs: Challenges and Open Problems

Michael Mitzenmacher

Rana Shahout

AI4TS LRM

249

10 Mar 2025

Auditing Prompt Caching in Language Model APIs

477

11 Feb 2025

KVDirect: Distributed Disaggregated LLM Inference

359

28 Jan 2025

HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location

Ting Sun

Penghan Wang

Fan Lai

1.4K

15 Jan 2025

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

1.2K

689

03 Jan 2025

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

...

693

187

02 Jan 2025

BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

420

25 Nov 2024

Context Parallelism for Scalable Million-Token Inference

599

04 Nov 2024

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

563

28 Oct 2024

Accelerating Direct Preference Optimization with Prefix Sharing

Franklin Wang

Sumanth Hegde

266

27 Oct 2024

A Simple Model of Inference Scaling Laws

Noam Levi

LRM

266

21 Oct 2024

Geometric Collaborative Filtering with ConvergenceInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2024

Hisham Husain

Julien Monteil

FedML

547

04 Oct 2024

InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference

Xiaolin Wang

241

08 Sep 2024

From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models

Sean Welleck

Ilia Kulikov

Zaid Harchaoui

445

133

24 Jun 2024

TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving Goodput

...

560

20 Jun 2024

New Solutions on LLM Acceleration, Optimization, and Application

Deming Chen

350

16 Jun 2024

Training of Physical Neural Networks

...

383

05 Jun 2024

Preble: Efficient Distributed Prompt Scheduling for LLM Serving

477

08 May 2024

DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference

Zeke Wang

518

30 Mar 2024

Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks

Bo-Ru Lu

Mari Ostendorf

242

19 Mar 2024

Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs

Ben Athiwaratkun

Sujan Kumar Gonugondla

...

309

13 Mar 2024

Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry

Keshav Santhanam

Deepti Raghavan

Muhammad Shahir Rahman

400

07 Mar 2024

SGLang: Efficient Execution of Structured Language Model ProgramsNeural Information Processing Systems (NeurIPS), 2023

...

540

701

12 Dec 2023

Fast Transformer Decoding: One Write-Head is All You Need

Noam M. Shazeer

830

731

06 Nov 2019