Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2403.01876
Cited By
DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving
4 March 2024
F. Strati
Sara Mcallister
Amar Phanishayee
Jakub Tarnawski
Ana Klimovic
Re-assign community
ArXiv
PDF
HTML
Papers citing
"DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving"
16 / 16 papers shown
Title
Taming the Titans: A Survey of Efficient LLM Inference Serving
Ranran Zhen
J. Li
Yixin Ji
Z. Yang
Tong Liu
Qingrong Xia
Xinyu Duan
Z. Wang
Baoxing Huai
M. Zhang
LLMAG
77
0
0
28 Apr 2025
Climate And Resource Awareness is Imperative to Achieving Sustainable AI (and Preventing a Global AI Arms Race)
Pedram Bakhtiarifard
Pınar Tözün
Christian Igel
Raghavendra Selvan
37
0
0
27 Feb 2025
KVDirect: Distributed Disaggregated LLM Inference
Shiyang Chen
Rain Jiang
Dezhi Yu
Jinlai Xu
Mengyuan Chao
Fanlong Meng
Chenyu Jiang
Wei Xu
Hang Liu
40
1
0
28 Jan 2025
SYMPHONY: Improving Memory Management for LLM Inference Workloads
Saurabh Agarwal
Anyong Mao
Aditya Akella
Shivaram Venkataraman
LLMAG
75
0
0
21 Dec 2024
On the Cost of Model-Serving Frameworks: An Experimental Evaluation
Pasquale De Rosa
Yérom-David Bromberg
Pascal Felber
Djob Mvondo
V. Schiavoni
20
0
0
15 Nov 2024
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
Xuanlin Jiang
Yang Zhou
Shiyi Cao
Ion Stoica
Minlan Yu
29
8
0
02 Nov 2024
BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batching
Peizhuang Cong
Qizhi Chen
Haochen Zhao
Tong Yang
KELM
21
0
0
24 Oct 2024
Efficient LLM Scheduling by Learning to Rank
Yichao Fu
Siqi Zhu
Runlong Su
Aurick Qiao
Ion Stoica
Hao Zhang
40
19
0
28 Aug 2024
P/D-Serve: Serving Disaggregated Large Language Model at Scale
Yibo Jin
Tao Wang
Huimin Lin
Mingyang Song
Peiyang Li
...
Haoliang Cheng
Xiaojing Li
Jiandong Ding
Hefei Guo
Zhengyong Zhang
MoE
19
8
0
15 Aug 2024
MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
Cunchen Hu
Heyang Huang
Junhao Hu
Jiang Xu
Xusheng Chen
...
Chenxi Wang
Sa Wang
Yungang Bao
Ninghui Sun
Yizhou Shan
LLMAG
45
12
0
25 Jun 2024
Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving
Chengyi Nie
Rodrigo Fonseca
Zhenhua Liu
24
4
0
11 May 2024
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
Ramya Prabhu
Ajay Nayak
Jayashree Mohan
R. Ramjee
Ashish Panwar
VLM
50
24
0
07 May 2024
Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services
Jiachen Liu
Zhiyu Wu
Jae-Won Chung
Fan Lai
Myungjin Lee
Mosharaf Chowdhury
33
22
0
25 Apr 2024
Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference
Muhammad Adnan
Akhil Arunkumar
Gaurav Jain
Prashant J. Nair
Ilya Soloveychik
Purushotham Kamath
16
52
0
14 Mar 2024
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
Yinmin Zhong
Shengyu Liu
Junda Chen
Jianbo Hu
Yibo Zhu
Xuanzhe Liu
Xin Jin
Hao Zhang
13
168
0
18 Jan 2024
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
Ying Sheng
Lianmin Zheng
Binhang Yuan
Zhuohan Li
Max Ryabinin
...
Joseph E. Gonzalez
Percy Liang
Christopher Ré
Ion Stoica
Ce Zhang
144
365
0
13 Mar 2023
1