Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2306.06000
Cited By
S
3
^{3}
3
: Increasing GPU Utilization during Generative Inference for Higher Throughput
9 June 2023
Yunho Jin
Chun-Feng Wu
David Brooks
Gu-Yeon Wei
Re-assign community
ArXiv
PDF
HTML
Papers citing
"S$^{3}$: Increasing GPU Utilization during Generative Inference for Higher Throughput"
42 / 42 papers shown
Title
Taming the Titans: A Survey of Efficient LLM Inference Serving
Ranran Zhen
J. Li
Yixin Ji
Z. Yang
Tong Liu
Qingrong Xia
Xinyu Duan
Z. Wang
Baoxing Huai
M. Zhang
LLMAG
77
0
0
28 Apr 2025
Tempo: Application-aware LLM Serving with Mixed SLO Requirements
Wei Zhang
Zhiyu Wu
Yi Mu
Banruo Liu
Myungjin Lee
Fan Lai
51
0
0
24 Apr 2025
High-Throughput LLM inference on Heterogeneous Clusters
Yi Xiong
Jinqi Huang
Wenjie Huang
Xuebing Yu
Entong Li
Zhixiong Ning
Jinhua Zhou
Li Zeng
Xin Chen
20
0
0
18 Apr 2025
Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving
Shihong Gao
X. Zhang
Yanyan Shen
Lei Chen
22
1
0
10 Apr 2025
SQuat: Subspace-orthogonal KV Cache Quantization
Hao Wang
Ligong Han
Kai Xu
Akash Srivastava
MQ
43
0
0
31 Mar 2025
Improving the End-to-End Efficiency of Offline Inference for Multi-LLM Applications Based on Sampling and Simulation
Jingzhi Fang
Yanyan Shen
Y. Wang
Lei Chen
36
2
0
21 Mar 2025
Mitigating KV Cache Competition to Enhance User Experience in LLM Inference
Haiying Shen
Tanmoy Sen
Masahiro Tanaka
72
0
0
17 Mar 2025
AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications
Haiying Shen
Tanmoy Sen
37
0
0
17 Mar 2025
Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference
Pol G. Recasens
Ferran Agullo
Yue Zhu
Chen Wang
Eun Kyung Lee
Olivier Tardieu
Jordi Torres
Josep Ll. Berral
41
0
0
11 Mar 2025
Queueing, Predictions, and LLMs: Challenges and Open Problems
Michael Mitzenmacher
Rana Shahout
AI4TS
LRM
36
1
0
10 Mar 2025
HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location
Ting Sun
Penghan Wang
Fan Lai
53
1
0
15 Jan 2025
Multi-Bin Batching for Increasing LLM Inference Throughput
Ozgur Guldogan
Jackson Kunde
Kangwook Lee
Ramtin Pedarsani
LRM
56
2
0
03 Dec 2024
Ensuring Fair LLM Serving Amid Diverse Applications
Redwan Ibne Seraj Khan
Kunal Jain
Haiying Shen
Ankur Mallick
Anjaly Parayil
...
Yue Cheng
A. R. Butt
Victor Rühle
Chetan Bansal
Saravan Rajmohan
73
0
0
24 Nov 2024
ALISE: Accelerating Large Language Model Serving with Speculative Scheduling
Youpeng Zhao
Jun Wang
24
0
0
31 Oct 2024
BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batching
Peizhuang Cong
Qizhi Chen
Haochen Zhao
Tong Yang
KELM
21
0
0
24 Oct 2024
Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs
Ferdi Kossmann
Bruce Fontaine
Daya Khudia
Michael Cafarella
Samuel Madden
45
1
0
23 Oct 2024
AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations
Qian Tao
Wenyuan Yu
Jingren Zhou
MQ
22
3
0
17 Oct 2024
A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models
Cong Guo
Feng Cheng
Zhixu Du
James Kiessling
Jonathan Ku
...
Qilin Zheng
Guanglei Zhou
Hai
Li-Wei Li
Yiran Chen
29
5
0
08 Oct 2024
ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving
Yifan Qiao
Shu Anzai
Shan Yu
Haoran Ma
Yang Wang
Miryung Kim
Harry Xu
26
2
0
02 Oct 2024
Don't Stop Me Now: Embedding Based Scheduling for LLMs
Rana Shahout
Eran Malach
Chunwei Liu
Weifan Jiang
Minlan Yu
Michael Mitzenmacher
AI4TS
18
4
0
01 Oct 2024
TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices
Zonghang Li
Wenjiao Feng
Mohsen Guizani
Hongfang Yu
36
2
0
01 Oct 2024
UELLM: A Unified and Efficient Approach for LLM Inference Serving
Yiyuan He
Minxian Xu
Jingfeng Wu
Wanyi Zheng
Kejiang Ye
Chengzhong Xu
20
0
0
23 Sep 2024
Efficient LLM Scheduling by Learning to Rank
Yichao Fu
Siqi Zhu
Runlong Su
Aurick Qiao
Ion Stoica
Hao Zhang
37
19
0
28 Aug 2024
Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Scheduling
Kunal Jain
Anjaly Parayil
Ankur Mallick
Esha Choukse
Xiaoting Qin
...
Chetan Bansal
Victor Rühle
Anoop Kulkarni
Steve Kofsky
Saravan Rajmohan
25
3
0
24 Aug 2024
DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
Jovan Stojkovic
Chaojie Zhang
Íñigo Goiri
Josep Torrellas
Esha Choukse
30
29
0
01 Aug 2024
LLM Inference Serving: Survey of Recent Advances and Opportunities
Baolin Li
Yankai Jiang
V. Gadepally
Devesh Tiwari
70
15
0
17 Jul 2024
Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving
Ke Cheng
Wen Hu
Zhi Wang
Hongen Peng
Jianguo Li
Sheng Zhang
32
7
0
19 Jun 2024
Enabling Efficient Batch Serving for LMaaS via Generation Length Prediction
Ke Cheng
Wen Hu
Zhi Wang
Peng Du
Jianguo Li
Sheng Zhang
26
10
0
07 Jun 2024
PerLLM: Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services
Zheming Yang
Yuanhao Yang
Chang Zhao
Qi Guo
Wenkai He
Wen Ji
38
13
0
23 May 2024
Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity
Tyler Griggs
Xiaoxuan Liu
Jiaxiang Yu
Doyoung Kim
Wei-Lin Chiang
Alvin Cheung
Ion Stoica
29
14
0
22 Apr 2024
A Survey on Efficient Inference for Large Language Models
Zixuan Zhou
Xuefei Ning
Ke Hong
Tianyu Fu
Jiaming Xu
...
Shengen Yan
Guohao Dai
Xiao-Ping Zhang
Yuhan Dong
Yu-Xiang Wang
46
78
0
22 Apr 2024
Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction
Haoran Qiu
Weichao Mao
Archit Patke
Shengkun Cui
Saurabh Jha
Chen Wang
Hubertus Franke
Zbigniew T. Kalbarczyk
Tamer Basar
Ravishankar K. Iyer
14
23
0
12 Apr 2024
Towards Pareto Optimal Throughput in Small Language Model Serving
Pol G. Recasens
Yue Zhu
Chen Wang
Eun Kyung Lee
Olivier Tardieu
Alaa Youssef
Jordi Torres
Josep Ll. Berral
30
4
0
04 Apr 2024
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Lu Ye
Ze Tao
Yong Huang
Yang Li
21
8
0
23 Feb 2024
Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods
Bo-Kyeong Kim
Geonmin Kim
Tae-Ho Kim
Thibault Castells
Shinkook Choi
Junho Shin
Hyoung-Kyu Song
49
28
0
05 Feb 2024
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Zirui Liu
Jiayi Yuan
Hongye Jin
Shaochen Zhong
Zhaozhuo Xu
Vladimir Braverman
Beidi Chen
Xia Hu
MQ
21
155
0
05 Feb 2024
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
Xupeng Miao
Gabriele Oliaro
Zhihao Zhang
Xinhao Cheng
Hongyi Jin
Tianqi Chen
Zhihao Jia
46
75
0
23 Dec 2023
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
Ying Sheng
Lianmin Zheng
Binhang Yuan
Zhuohan Li
Max Ryabinin
...
Joseph E. Gonzalez
Percy Liang
Christopher Ré
Ion Stoica
Ce Zhang
141
365
0
13 Mar 2023
Can Foundation Models Wrangle Your Data?
A. Narayan
Ines Chami
Laurel J. Orr
Simran Arora
Christopher Ré
LMTD
AI4CE
164
212
0
20 May 2022
I-BERT: Integer-only BERT Quantization
Sehoon Kim
A. Gholami
Z. Yao
Michael W. Mahoney
Kurt Keutzer
MQ
86
332
0
05 Jan 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
242
1,977
0
31 Dec 2020
Big Bird: Transformers for Longer Sequences
Manzil Zaheer
Guru Guruganesh
Kumar Avinava Dubey
Joshua Ainslie
Chris Alberti
...
Philip Pham
Anirudh Ravula
Qifan Wang
Li Yang
Amr Ahmed
VLM
249
1,982
0
28 Jul 2020
1