ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.05099
  4. Cited By
Hydragen: High-Throughput LLM Inference with Shared Prefixes
v1v2 (latest)

Hydragen: High-Throughput LLM Inference with Shared Prefixes

7 February 2024
Jordan Juravsky
Bradley Brown
Ryan Ehrlich
Daniel Y. Fu
Christopher Ré
Azalia Mirhoseini
ArXiv (abs)PDFHTMLHuggingFace (20 upvotes)Github (49★)

Papers citing "Hydragen: High-Throughput LLM Inference with Shared Prefixes"

35 / 35 papers shown
On the Role of Temperature Sampling in Test-Time Scaling
On the Role of Temperature Sampling in Test-Time Scaling
Yuheng Wu
Azalia Mirhoseini
Thierry Tambe
ALMLRM
175
5
1
02 Oct 2025
TyphoonMLA: A Mixed Naive-Absorb MLA Kernel For Shared Prefix
TyphoonMLA: A Mixed Naive-Absorb MLA Kernel For Shared Prefix
Ahmet Caner Yüzügüler
Ahmet Çelik
Jiawei Zhuang
Lukas Cavigelli
207
0
0
25 Sep 2025
Learned Structure in Cartridges: Keys as Shareable Routers in Self-Studied Representations
Learned Structure in Cartridges: Keys as Shareable Routers in Self-Studied Representations
Maurizio Diaz
190
0
0
23 Aug 2025
Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions
Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions
Bangsheng Tang
Carl Chengyan Fu
Fei Kou
Grigory Sizov
Haoci Zhang
...
Vlad Mihailescu
Xingwen Guo
Yan Cui
Y. Hu
Yejin Lee
LRM
388
6
0
11 Aug 2025
Optimal Scheduling Algorithms for LLM Inference: Theory and Practice
Optimal Scheduling Algorithms for LLM Inference: Theory and PracticeProceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS), 2025
Agrim Bari
Parikshit Hegde
G. Veciana
241
1
0
01 Aug 2025
CaliDrop: KV Cache Compression with Calibration
CaliDrop: KV Cache Compression with Calibration
Yi Su
Quantong Qiu
Yuechi Zhou
Juntao Li
Qingrong Xia
Ping Li
Xinyu Duan
Zhefeng Wang
Min Zhang
259
1
0
26 Jul 2025
ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism
ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism
Zedong Liu
Shenggan Cheng
Guangming Tan
Yang You
Dingwen Tao
622
5
0
14 Jul 2025
Kinetics: Rethinking Test-Time Scaling Laws
Kinetics: Rethinking Test-Time Scaling Laws
Ranajoy Sadhukhan
Zhuoming Chen
Haizhong Zheng
Yang Zhou
Emma Strubell
Beidi Chen
499
9
0
05 Jun 2025
SpecMemo: Speculative Decoding is in Your Pocket
SpecMemo: Speculative Decoding is in Your Pocket
Selin Yildirim
Deming Chen
227
0
0
16 May 2025
Accurate KV Cache Quantization with Outlier Tokens Tracing
Accurate KV Cache Quantization with Outlier Tokens TracingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Yi Su
Yuechi Zhou
Quantong Qiu
Jilong Li
Qingrong Xia
Ping Li
Xinyu Duan
Zhefeng Wang
Min Zhang
MQ
376
9
0
16 May 2025
MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints
MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints
Yichao Yuan
Lin Ma
Nishil Talati
MoE
349
2
0
12 Apr 2025
Queueing, Predictions, and LLMs: Challenges and Open Problems
Queueing, Predictions, and LLMs: Challenges and Open Problems
Michael Mitzenmacher
Rana Shahout
AI4TSLRM
249
4
0
10 Mar 2025
Auditing Prompt Caching in Language Model APIs
Auditing Prompt Caching in Language Model APIs
Chenchen Gu
Xiang Lisa Li
Rohith Kuditipudi
Percy Liang
Tatsunori Hashimoto
477
12
0
11 Feb 2025
KVDirect: Distributed Disaggregated LLM Inference
KVDirect: Distributed Disaggregated LLM Inference
Shiyang Chen
Rain Jiang
Dezhi Yu
Jinlai Xu
Mengyuan Chao
Fanlong Meng
Chenyu Jiang
Wei Xu
Hang Liu
359
8
0
28 Jan 2025
HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location
HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location
Ting Sun
Penghan Wang
Fan Lai
1.4K
9
0
15 Jan 2025
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Bradley Brown
Jordan Juravsky
Ryan Ehrlich
Ronald Clark
Quoc V. Le
Christopher Ré
Azalia Mirhoseini
ALMLRM
1.2K
689
0
03 Jan 2025
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
Zihao Ye
Lequn Chen
Ruihang Lai
Wuwei Lin
Yineng Zhang
...
Tianqi Chen
Baris Kasikci
Vinod Grover
Arvind Krishnamurthy
Luis Ceze
693
187
0
02 Jan 2025
BlendServe: Optimizing Offline Inference for Auto-regressive Large
  Models with Resource-aware Batching
BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching
Yilong Zhao
Shuo Yang
Kan Zhu
Lianmin Zheng
Baris Kasikci
Yang Zhou
Jiarong Xing
Eric Liang
420
20
0
25 Nov 2024
Context Parallelism for Scalable Million-Token Inference
Context Parallelism for Scalable Million-Token Inference
Amy Yang
Jingyi Yang
Aya Ibrahim
Xinfeng Xie
Bangsheng Tang
Grigory Sizov
Jeremy Reizenstein
Jongsoo Park
Jianyu Huang
MoELRM
599
29
0
04 Nov 2024
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Hanshi Sun
Li-Wen Chang
Yiyuan Ma
Wenlei Bao
Ningxin Zheng
Xin Liu
Harry Dong
Yuejie Chi
Beidi Chen
VLM
563
79
0
28 Oct 2024
Accelerating Direct Preference Optimization with Prefix Sharing
Accelerating Direct Preference Optimization with Prefix Sharing
Franklin Wang
Sumanth Hegde
266
2
0
27 Oct 2024
A Simple Model of Inference Scaling Laws
A Simple Model of Inference Scaling Laws
Noam Levi
LRM
266
25
0
21 Oct 2024
Geometric Collaborative Filtering with Convergence
Geometric Collaborative Filtering with ConvergenceInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2024
Hisham Husain
Julien Monteil
FedML
547
23
0
04 Oct 2024
InstInfer: In-Storage Attention Offloading for Cost-Effective
  Long-Context LLM Inference
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference
Xiurui Pan
Endian Li
Qiao Li
Shengwen Liang
Yizhou Shan
Ke Zhou
Yingwei Luo
Xiaolin Wang
Jie Zhang
241
25
0
08 Sep 2024
From Decoding to Meta-Generation: Inference-time Algorithms for Large
  Language Models
From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
Sean Welleck
Amanda Bertsch
Matthew Finlayson
Hailey Schoelkopf
Alex Xie
Graham Neubig
Ilia Kulikov
Zaid Harchaoui
445
133
0
24 Jun 2024
TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving Goodput
TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving Goodput
Xiaoxuan Liu
Cade Daniel
Langxiang Hu
Woosuk Kwon
Zhuohan Li
...
Kaichao You
Alvin Cheung
Zhijie Deng
Ion Stoica
Hao Zhang
560
23
0
20 Jun 2024
New Solutions on LLM Acceleration, Optimization, and Application
New Solutions on LLM Acceleration, Optimization, and Application
Yingbing Huang
Lily Jiaxin Wan
Hanchen Ye
Manvi Jha
Jinghua Wang
Yuhong Li
Xiaofan Zhang
Deming Chen
350
23
0
16 Jun 2024
Training of Physical Neural Networks
Training of Physical Neural Networks
Ali Momeni
Babak Rahmani
B. Scellier
Logan G. Wright
Peter L. McMahon
...
Julie Grollier
Andrea J. Liu
D. Psaltis
Andrea Alù
Romain Fleury
PINNAI4CE
383
99
0
05 Jun 2024
Preble: Efficient Distributed Prompt Scheduling for LLM Serving
Preble: Efficient Distributed Prompt Scheduling for LLM Serving
Vikranth Srivatsa
Zijian He
Reyna Abhyankar
Dongming Li
Yiying Zhang
477
53
0
08 May 2024
DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference
DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference
Jinwei Yao
Kaiqi Chen
Kexun Zhang
Jiaxuan You
Binhang Yuan
Zeke Wang
Tao Lin
518
12
0
30 Mar 2024
Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks
Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks
Bo-Ru Lu
Nikita Haduong
Chien-Yu Lin
Hao Cheng
Noah A. Smith
Mari Ostendorf
AI4CE
242
2
0
19 Mar 2024
Bifurcated Attention: Accelerating Massively Parallel Decoding with
  Shared Prefixes in LLMs
Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs
Ben Athiwaratkun
Sujan Kumar Gonugondla
Sanjay Krishna Gouda
Haifeng Qian
Hantian Ding
...
Liangfu Chen
Parminder Bhatia
Ramesh Nallapati
Sudipta Sengupta
Bing Xiang
309
5
0
13 Mar 2024
Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry
Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry
Keshav Santhanam
Deepti Raghavan
Muhammad Shahir Rahman
Thejas Venkatesh
Neha Kunjal
Maximilien Cura
Houjun Liu
Pratiksha Thaker
Philip Levis
Matei A. Zaharia
400
9
0
07 Mar 2024
SGLang: Efficient Execution of Structured Language Model Programs
SGLang: Efficient Execution of Structured Language Model ProgramsNeural Information Processing Systems (NeurIPS), 2023
Lianmin Zheng
Liangsheng Yin
Zhiqiang Xie
Chuyue Sun
Jeff Huang
...
Christos Kozyrakis
Ion Stoica
Joseph E. Gonzalez
Clark W. Barrett
Ying Sheng
LRM
540
701
0
12 Dec 2023
Fast Transformer Decoding: One Write-Head is All You Need
Fast Transformer Decoding: One Write-Head is All You Need
Noam M. Shazeer
830
731
0
06 Nov 2019
1
Page 1 of 1