Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Title
Home
Papers
2302.11665
Cited By
v1
v2 (latest)
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving
22 February 2023
Zhuohan Li
Lianmin Zheng
Yinmin Zhong
Vincent Liu
Ying Sheng
Xin Jin
Yanping Huang
Zhifeng Chen
Hao Zhang
Joseph E. Gonzalez
Ion Stoica
MoE
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving"
50 / 64 papers shown
Title
From Models to Operators: Rethinking Autoscaling Granularity for Large Generative Models
Xingqi Cui
Chieh-Jan Mike Liang
Jiarong Xing
Haoran Qiu
88
0
0
04 Nov 2025
Cross-Embodiment Dexterous Hand Articulation Generation via Morphology-Aware Learning
Heng Zhang
Kevin Yuchen Ma
Mike Zheng Shou
Weisi Lin
Yan Wu
132
10
0
07 Oct 2025
ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching
Xingyu Xiang
Raj Joshi
Yuhan Liu
Jiayi Yao
Chenxingyu Zhao
Junchen Jiang
Yang Zhou
Eddie Kohler
Minlan Yu
133
0
0
21 Sep 2025
Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference
Rongzhi Li
Ruogu Du
Zefang Chu
Sida Zhao
Chunlei Han
...
Yiwen Shao
Huanle Han
Long Huang
Zherui Liu
Shufan Liu
92
1
0
27 Aug 2025
A Survey on Cloud-Edge-Terminal Collaborative Intelligence in AIoT Networks
Jiaqi Wu
Jing Liu
Yang Liu
Lixu Wang
Z. Wang
Wei Chen
Zijian Tian
Richard Yu
Victor C.M. Leung
109
0
0
26 Aug 2025
LeMix: Unified Scheduling for LLM Training and Inference on Multi-GPU Systems
Yufei Li
Zexin Li
Yinglun Zhu
Cong Liu
115
2
0
28 Jul 2025
PolyServe: Efficient Multi-SLO Serving at Scale
Kan Zhu
Haiyang Shi
Le Xu
Jiaxin Shan
Arvind Krishnamurthy
Baris Kasikci
Liguang Xie
MoE
127
2
0
17 Jul 2025
ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism
Zedong Liu
Shenggan Cheng
Guangming Tan
Yang You
Dingwen Tao
480
3
0
14 Jul 2025
HybridServe: Efficient Serving of Large AI Models with Confidence-Based Cascade Routing
Leyang Xue
Yao Fu
Luo Mai
Mahesh K. Marina
312
1
0
18 May 2025
MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems
Yao Fu
Yao Fu
Yeqi Huang
Ping Nie
Zhan Lu
...
Dayou Du
Tairan Xu
Dayou Du
Edoardo Ponti
Luo Mai
MoE
354
1
0
16 May 2025
ELIS: Efficient LLM Iterative Scheduling System with Response Length Predictor
Seungbeom Choi
Jeonghoe Goo
Eunjoo Jeon
Mingyu Yang
Minsung Jang
268
4
0
14 May 2025
Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving
Shan Yu
Jiarong Xing
Yifan Qiao
Mingyuan Ma
Y. Li
...
Shiyi Cao
Ke Bao
Ion Stoica
Harry Xu
Ying Sheng
259
13
0
06 May 2025
Circinus: Efficient Query Planner for Compound ML Serving
Banruo Liu
Wei-Yu Lin
Minghao Fang
Yihan Jiang
Fan Lai
LRM
191
1
0
23 Apr 2025
Efficient LLM Serving on Hybrid Real-time and Best-effort Requests
Wan Borui
Zhao Juntao
Jiang Chenyu
Guo Chuanxiong
Wu Chuan
VLM
274
7
0
13 Apr 2025
AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications
Haiying Shen
Tanmoy Sen
253
2
0
17 Mar 2025
Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference
IEEE International Conference on Cloud Computing (CLOUD), 2025
Pol G. Recasens
Ferran Agullo
Yue Zhu
Chen Wang
Eun Kyung Lee
Olivier Tardieu
Jordi Torres
Josep Ll. Berral
244
18
0
11 Mar 2025
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2025
Yintao He
Haiyu Mao
Christina Giannoula
Mohammad Sadrosadati
Juan Gómez Luna
Huawei Li
Xiaowei Li
Ying Wang
O. Mutlu
346
18
0
21 Feb 2025
iServe: An Intent-based Serving System for LLMs
Dimitrios Liakopoulos
Tianrui Hu
Prasoon Sinha
N. Yadwadkar
VLM
991
0
0
08 Jan 2025
TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2025
Jovan Stojkovic
Chaojie Zhang
Íñigo Goiri
Esha Choukse
Haoran Qiu
Rodrigo Fonseca
Josep Torrellas
Ricardo Bianchini
209
21
0
05 Jan 2025
SMDP-Based Dynamic Batching for Improving Responsiveness and Energy Efficiency of Batch Services
IEEE Transactions on Parallel and Distributed Systems (TPDS), 2025
Yaodan Xu
Sheng Zhou
Zhisheng Niu
149
2
0
04 Jan 2025
Deploying Foundation Model Powered Agent Services: A Survey
Wenchao Xu
Jinyu Chen
Peirong Zheng
Xiaoquan Yi
Tianyi Tian
...
Quan Wan
Yining Qi
Yunfeng Fan
Qinliang Su
Xuemin Shen
AI4CE
447
5
0
18 Dec 2024
Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
Wei Zhao
Anand Jayarajan
Gennady Pekhimenko
FedML
189
4
0
09 Oct 2024
Scaling Laws For Mixed Quantization
Zeyu Cao
Boyang Gu
Cheng Zhang
Pedro Gimenes
Jianqiao Lu
Jianyi Cheng
Xitong Gao
Yiren Zhao
MQ
305
1
0
09 Oct 2024
TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices
Zonghang Li
Wenjiao Feng
Mohsen Guizani
Hongfang Yu
158
8
0
01 Oct 2024
P/D-Serve: Serving Disaggregated Large Language Model at Scale
Yibo Jin
Tao Wang
Huimin Lin
Mingyang Song
Peiyang Li
...
Haoliang Cheng
Xiaojing Li
Jiandong Ding
Hefei Guo
Zhengyong Zhang
MoE
182
27
0
15 Aug 2024
DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
International Symposium on High-Performance Computer Architecture (HPCA), 2024
Jovan Stojkovic
Chaojie Zhang
Íñigo Goiri
Josep Torrellas
Esha Choukse
205
77
0
01 Aug 2024
CascadeServe: Unlocking Model Cascades for Inference Serving
Ferdi Kossmann
Ziniu Wu
Alex Turk
Nesime Tatbul
Lei Cao
Samuel Madden
205
7
0
20 Jun 2024
Resource Allocation and Workload Scheduling for Large-Scale Distributed Deep Learning: A Survey
Feng Liang
Zhen Zhang
Haifeng Lu
Chengming Li
Victor C. M. Leung
Yanyi Guo
Xiping Hu
317
9
0
12 Jun 2024
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
Zigeng Chen
Xinyin Ma
Gongfan Fang
Zhenxiong Tan
Xinchao Wang
291
16
0
11 Jun 2024
Llumnix: Dynamic Scheduling for Large Language Model Serving
Biao Sun
Ziming Huang
Hanyu Zhao
Wencong Xiao
Xinyi Zhang
Yong Li
Jialin Li
187
127
0
05 Jun 2024
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
Chaofan Lin
Zhenhua Han
Chengruidong Zhang
Yuqing Yang
Fan Yang
Chen Chen
Lili Qiu
267
75
0
30 May 2024
EdgeSight: Enabling Modeless and Cost-Efficient Inference at the Edge
ChonLam Lao
Jiaqi Gao
Ganesh Ananthanarayanan
Aditya Akella
Minlan Yu
VLM
201
0
0
29 May 2024
Galaxy: A Resource-Efficient Collaborative Edge AI System for In-situ Transformer Inference
Shengyuan Ye
Jiangsu Du
Liekang Zeng
Wenzhong Ou
Xiaowen Chu
Yutong Lu
Xu Chen
178
36
0
27 May 2024
Preble: Efficient Distributed Prompt Scheduling for LLM Serving
Vikranth Srivatsa
Zijian He
Reyna Abhyankar
Dongming Li
Yiying Zhang
397
40
0
08 May 2024
Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services
Jiachen Liu
Zhiyu Wu
Jae-Won Chung
Fan Lai
Myungjin Lee
Mosharaf Chowdhury
208
42
0
25 Apr 2024
Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction
Haoran Qiu
Weichao Mao
Archit Patke
Shengkun Cui
Saurabh Jha
Chen Wang
Hubertus Franke
Zbigniew T. Kalbarczyk
Tamer Basar
Ravishankar K. Iyer
199
47
0
12 Apr 2024
Towards Pareto Optimal Throughput in Small Language Model Serving
Pol G. Recasens
Yue Zhu
Chen Wang
Eun Kyung Lee
Olivier Tardieu
Alaa Youssef
Jordi Torres
Josep Ll. Berral
181
9
0
04 Apr 2024
MOPAR: A Model Partitioning Framework for Deep Learning Inference Services on Serverless Platforms
Jiaang Duan
Shiyou Qian
Dingyu Yang
Hanwen Hu
Jian Cao
Guangtao Xue
MoE
165
3
0
03 Apr 2024
Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference
Jovan Stojkovic
Esha Choukse
Chaojie Zhang
Inigo Goiri
Josep Torrellas
140
64
0
29 Mar 2024
FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines
Jiaao He
Jidong Zhai
198
49
0
18 Mar 2024
Characterization of Large Language Model Development in the Datacenter
Symposium on Networked Systems Design and Implementation (NSDI), 2024
Qi Hu
Zhisheng Ye
Zerui Wang
Guoteng Wang
Mengdie Zhang
...
Dahua Lin
Xiaolin Wang
Yingwei Luo
Yonggang Wen
Tianwei Zhang
163
93
0
12 Mar 2024
DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
Zhekai Zhang
Tianle Cai
Jiaxin Cao
Qinsheng Zhang
Han Cai
Junjie Bai
Yangqing Jia
Ming-Yu Liu
Kai Li
Song Han
DiffM
346
93
0
29 Feb 2024
Compass: A Decentralized Scheduler for Latency-Sensitive ML Workflows
Yuting Yang
Andrea Merlina
Weijia Song
Tiancheng Yuan
Ken Birman
Roman Vitenberg
167
0
0
27 Feb 2024
ServeFlow: A Fast-Slow Model Architecture for Network Traffic Analysis
Shinan Liu
Ted Shaowang
Gerry Wan
Jeewon Chae
Jonatas Marques
Sanjay Krishnan
Nick Feamster
249
7
0
06 Feb 2024
BurstGPT: A Real-world Workload Dataset to Optimize LLM Serving Systems
Yuxin Wang
Yuhan Chen
Zeyu Li
Xueze Kang
Zhenheng Tang
...
Rui Guo
Xin Wang
Qiang-qiang Wang
Amelie Chi Zhou
Xiaowen Chu
492
37
0
31 Jan 2024
ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models
USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2024
Yao Fu
Leyang Xue
Yeqi Huang
Andrei-Octavian Brabete
Dmitrii Ustiugov
Yuvraj Patel
Luo Mai
135
6
0
25 Jan 2024
MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache
Leyang Xue
Yao Fu
Zhan Lu
Luo Mai
Mahesh K. Marina
MoE
286
4
0
25 Jan 2024
CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference
Suyi Li
Hanfeng Lu
Tianyuan Wu
Minchen Yu
Qizhen Weng
Xusheng Chen
Yizhou Shan
Binhang Yuan
Wei Wang
200
25
0
20 Jan 2024
Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
Cunchen Hu
Heyang Huang
Liangliang Xu
Xusheng Chen
Jiang Xu
...
Chenxi Wang
Sa Wang
Yungang Bao
Ninghui Sun
Yizhou Shan
DRL
252
137
0
20 Jan 2024
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
Yinmin Zhong
Shengyu Liu
Junda Chen
Jianbo Hu
Yibo Zhu
Xuanzhe Liu
Xin Jin
Hao Zhang
269
410
0
18 Jan 2024
1
2
Next