ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2006.02464
  4. Cited By
Serving DNNs like Clockwork: Performance Predictability from the Bottom
  Up
v1v2 (latest)

Serving DNNs like Clockwork: Performance Predictability from the Bottom Up

USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2020
3 June 2020
A. Gujarati
Reza Karimi
Safya Alzayat
Wei Hao
Antoine Kaufmann
Ymir Vigfusson
Jonathan Mace
ArXiv (abs)PDFHTML

Papers citing "Serving DNNs like Clockwork: Performance Predictability from the Bottom Up"

50 / 107 papers shown
ScaleDL: Towards Scalable and Efficient Runtime Prediction for Distributed Deep Learning Workloads
ScaleDL: Towards Scalable and Efficient Runtime Prediction for Distributed Deep Learning Workloads
Xiaokai Wang
Shaoyuan Huang
Yuting Li
Xiaofei Wang
GNNAI4CE
345
0
0
06 Nov 2025
FairBatching: Fairness-Aware Batch Formation for LLM Inference
FairBatching: Fairness-Aware Batch Formation for LLM Inference
Hongtao Lyu
Boyue Liu
Mingyu Wu
Haibo Chen
124
3
0
16 Oct 2025
FineServe: Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving
FineServe: Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving
Kyungmin Bin
Seungbeom Choi
Jimyoung Son
Jieun Choi
Daseul Bae
Daehyeon Baek
Kihyo Moon
Minsung Jang
Hyojung Lee
163
3
0
08 Sep 2025
Orthogonal Soft Pruning for Efficient Class Unlearning
Orthogonal Soft Pruning for Efficient Class Unlearning
Qinghui Gong
Xue Yang
Xiaohu Tang
MU
396
0
0
24 Jun 2025
HybridServe: Efficient Serving of Large AI Models with Confidence-Based Cascade Routing
HybridServe: Efficient Serving of Large AI Models with Confidence-Based Cascade Routing
Leyang Xue
Yao Fu
Luo Mai
Mahesh K. Marina
385
1
0
18 May 2025
ELIS: Efficient LLM Iterative Scheduling System with Response Length Predictor
ELIS: Efficient LLM Iterative Scheduling System with Response Length Predictor
Seungbeom Choi
Jeonghoe Goo
Eunjoo Jeon
Mingyu Yang
Minsung Jang
313
6
0
14 May 2025
Patchwork: A Unified Framework for RAG Serving
Patchwork: A Unified Framework for RAG Serving
Bodun Hu
Luis Pabon
Saurabh Agarwal
Aditya Akella
285
0
0
01 May 2025
Circinus: Efficient Query Planner for Compound ML Serving
Circinus: Efficient Query Planner for Compound ML Serving
Banruo Liu
Wei-Yu Lin
Minghao Fang
Yihan Jiang
Fan Lai
LRM
254
5
0
23 Apr 2025
Guillotine: Hypervisors for Isolating Malicious AIs
Guillotine: Hypervisors for Isolating Malicious AIsUSENIX Workshop on Hot Topics in Operating Systems (HotOS), 2025
James Mickens
Sarah Radway
Ravi Netravali
397
1
0
22 Apr 2025
StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation
StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation
Yinmin Zhong
Zili Zhang
Xiaoniu Song
Hanpeng Hu
Chao Jin
...
Changyi Wan
Hongyu Zhou
Yimin Jiang
Yibo Zhu
Daxin Jiang
OffRLAI4TS
492
27
0
22 Apr 2025
LithOS: An Operating System for Efficient Machine Learning on GPUs
LithOS: An Operating System for Efficient Machine Learning on GPUsSymposium on Operating Systems Principles (SOSP), 2025
Patrick H. Coppock
Brian Zhang
Eliot H. Solomon
Vasilis Kypriotis
Leon Yang
Bikash Sharma
Dan Schatzberg
Todd C. Mowry
Dimitrios Skarlatos
213
12
0
21 Apr 2025
HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
Avinash Kumar
Shashank Nag
Jason Clemons
L. John
Poulami Das
521
7
0
14 Apr 2025
Niyama : Breaking the Silos of LLM Inference Serving
Niyama : Breaking the Silos of LLM Inference Serving
Kanishk Goel
Jayashree Mohan
Nipun Kwatra
Ravi Anupindi
Ramachandran Ramjee
415
4
0
28 Mar 2025
TURBO: Utility-Aware Bandwidth Allocation for Cloud-Augmented Autonomous Control
TURBO: Utility-Aware Bandwidth Allocation for Cloud-Augmented Autonomous Control
Peter Schafhalter
Alexander Krentsel
Alfons Kemper
Sylvia Ratnasamy
S. Shenker
Ion Stoica
Ion Stoica
286
1
0
26 Mar 2025
iServe: An Intent-based Serving System for LLMs
iServe: An Intent-based Serving System for LLMs
Dimitrios Liakopoulos
Tianrui Hu
Prasoon Sinha
N. Yadwadkar
VLM
1.1K
3
0
08 Jan 2025
Deploying Foundation Model Powered Agent Services: A Survey
Deploying Foundation Model Powered Agent Services: A Survey
Wenchao Xu
Jinyu Chen
Peirong Zheng
Xiaoquan Yi
Tianyi Tian
...
Quan Wan
Yining Qi
Yunfeng Fan
Qinliang Su
Xuemin Shen
AI4CE
561
7
0
18 Dec 2024
DiffServe: Efficiently Serving Text-to-Image Diffusion Models with
  Query-Aware Model Scaling
DiffServe: Efficiently Serving Text-to-Image Diffusion Models with Query-Aware Model Scaling
Sohaib Ahmad
Qizheng Yang
Haoliang Wang
Ramesh K. Sitaraman
Hui Guan
347
7
0
22 Nov 2024
Software Performance Engineering for Foundation Model-Powered Software
Software Performance Engineering for Foundation Model-Powered Software
Haoxiang Zhang
Shi Chang
Arthur Leung
Kishanthan Thangarajah
Boyuan Chen
Hanan Lutfiyya
Ahmed E. Hassan
622
3
0
14 Nov 2024
ALISE: Accelerating Large Language Model Serving with Speculative
  Scheduling
ALISE: Accelerating Large Language Model Serving with Speculative SchedulingInternational Conference on Computer Aided Design (ICCAD), 2024
Youpeng Zhao
Jun Wang
211
6
0
31 Oct 2024
Managing Bandwidth: The Key to Cloud-Assisted Autonomous Driving
Managing Bandwidth: The Key to Cloud-Assisted Autonomous Driving
Alexander Krentsel
Peter Schafhalter
Alfons Kemper
Sylvia Ratnasamy
S. Shenker
Ion Stoica
254
3
0
21 Oct 2024
ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving
ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving
Yifan Qiao
Shu Anzai
S. Yu
Haoran Ma
Yang Wang
...
Yang Zhou
Jiarong Xing
Joseph E. Gonzalez
Ion Stoica
Harry Xu
317
7
0
02 Oct 2024
A House United Within Itself: SLO-Awareness for On-Premises
  Containerized ML Inference Clusters via Faro
A House United Within Itself: SLO-Awareness for On-Premises Containerized ML Inference Clusters via FaroEuropean Conference on Computer Systems (EuroSys), 2024
Beomyeol Jeon
Chen Wang
Diana Arroyo
Alaa Youssef
Indranil Gupta
235
1
0
29 Sep 2024
ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in
  Cloud Environments
ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud EnvironmentsInternational Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2024
Munkyu Lee
Sihoon Seong
Minki Kang
Jihyuk Lee
Gap-Joo Na
In-Geol Chun
Dimitrios Nikolopoulos
Cheol-Ho Hong
GNN
197
17
0
22 Sep 2024
Improving GPU Multi-Tenancy Through Dynamic Multi-Instance GPU
  Reconfiguration
Improving GPU Multi-Tenancy Through Dynamic Multi-Instance GPU Reconfiguration
Tianyu Wang
Sheng Li
Bingyao Li
Yuezhen Dai
Ao Li
Geng Yuan
Yufei Ding
Youtao Zhang
Xulong Tang
312
10
0
18 Jul 2024
Etalon: Holistic Performance Evaluation Framework for LLM Inference
  Systems
Etalon: Holistic Performance Evaluation Framework for LLM Inference Systems
Amey Agrawal
Anmol Agarwal
Nitin Kedia
Jayashree Mohan
Souvik Kundu
Nipun Kwatra
Ramachandran Ramjee
Alexey Tumanov
306
13
0
09 Jul 2024
Loki: A System for Serving ML Inference Pipelines with Hardware and
  Accuracy Scaling
Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy Scaling
Sohaib Ahmad
Hui Guan
Ramesh K. Sitaraman
274
10
0
04 Jul 2024
SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules
SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules
Suyi Li
Lingyun Yang
Xiaoxiao Jiang
Hanfeng Lu
Zhipeng Di
...
Tao Lan
Guodong Yang
Lin Qu
Liping Zhang
Wei Wang
207
6
0
02 Jul 2024
Teola: Towards End-to-End Optimization of LLM-based Applications
Teola: Towards End-to-End Optimization of LLM-based Applications
Xin Tan
Yimin Jiang
Yitao Yang
Hong-Yu Xu
727
15
0
29 Jun 2024
NetNN: Neural Intrusion Detection System in Programmable Networks
NetNN: Neural Intrusion Detection System in Programmable Networks
Kamran Razavi
Shayan Davari Fard
George Karlos
Vinod Nigade
Max Mühlhäuser
Lin Wang
200
7
0
28 Jun 2024
InfiniGen: Efficient Generative Inference of Large Language Models with
  Dynamic KV Cache Management
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
Wonbeom Lee
Jungi Lee
Junghwan Seo
Jaewoong Sim
RALM
260
222
0
28 Jun 2024
CascadeServe: Unlocking Model Cascades for Inference Serving
CascadeServe: Unlocking Model Cascades for Inference Serving
Ferdi Kossmann
Ziniu Wu
Alex Turk
Nesime Tatbul
Lei Cao
Samuel Madden
264
8
0
20 Jun 2024
Llumnix: Dynamic Scheduling for Large Language Model Serving
Llumnix: Dynamic Scheduling for Large Language Model Serving
Biao Sun
Ziming Huang
Hanyu Zhao
Wencong Xiao
Xinyi Zhang
Yong Li
Jialin Li
276
166
0
05 Jun 2024
Parrot: Efficient Serving of LLM-based Applications with Semantic
  Variable
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
Chaofan Lin
Zhenhua Han
Chengruidong Zhang
Yuqing Yang
Fan Yang
Chen Chen
Lili Qiu
332
101
0
30 May 2024
EdgeSight: Enabling Modeless and Cost-Efficient Inference at the Edge
EdgeSight: Enabling Modeless and Cost-Efficient Inference at the Edge
ChonLam Lao
Jiaqi Gao
Ganesh Ananthanarayanan
Aditya Akella
Minlan Yu
VLM
260
0
0
29 May 2024
Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving
Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving
Chengyi Nie
Rodrigo Fonseca
Zhenhua Liu
313
20
0
11 May 2024
Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text
  Streaming Services
Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services
Jiachen Liu
Zhiyu Wu
Jae-Won Chung
Fan Lai
Myungjin Lee
Mosharaf Chowdhury
262
52
0
25 Apr 2024
Efficient Interactive LLM Serving with Proxy Model-based Sequence Length
  Prediction
Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction
Haoran Qiu
Weichao Mao
Archit Patke
Shengkun Cui
Saurabh Jha
Chen Wang
Hubertus Franke
Zbigniew T. Kalbarczyk
Tamer Basar
Ravishankar K. Iyer
276
64
0
12 Apr 2024
MOPAR: A Model Partitioning Framework for Deep Learning Inference
  Services on Serverless Platforms
MOPAR: A Model Partitioning Framework for Deep Learning Inference Services on Serverless Platforms
Jiaang Duan
Shiyou Qian
Dingyu Yang
Hanwen Hu
Jian Cao
Guangtao Xue
MoE
256
4
0
03 Apr 2024
MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM
  Serving
MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM ServingInternational Conference on Machine Learning (ICML), 2024
Jiangfei Duan
Runyu Lu
Haojie Duanmu
Xiuhong Li
Xingcheng Zhang
Dahua Lin
Ion Stoica
Hao Zhang
295
46
0
02 Apr 2024
Sponge: Inference Serving with Dynamic SLOs Using In-Place Vertical
  Scaling
Sponge: Inference Serving with Dynamic SLOs Using In-Place Vertical Scaling
Kamran Razavi
Saeid Ghafouri
Max Mühlhäuser
Pooyan Jamshidi
Lin Wang
282
8
0
31 Mar 2024
Hydro: Adaptive Query Processing of ML Queries
Hydro: Adaptive Query Processing of ML Queries
Gaurav Tarlok Kakkar
Jiashen Cao
Aubhro Sengupta
Joy Arulraj
Hyesoon Kim
208
3
0
22 Mar 2024
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
Amey Agrawal
Nitin Kedia
Ashish Panwar
Jayashree Mohan
Nipun Kwatra
Bhargav S. Gulavani
Alexey Tumanov
Ramachandran Ramjee
544
449
0
04 Mar 2024
FlexLLM: Token-Level Co-Serving of LLM Inference and Finetuning with SLO Guarantees
FlexLLM: Token-Level Co-Serving of LLM Inference and Finetuning with SLO Guarantees
Xupeng Miao
Xupeng Miao
Xinhao Cheng
Vineeth Kada
Mengdi Wu
...
April Yang
April Yang
Yingcheng Wang
Colin Unger
Zhihao Jia
MoE
646
16
0
29 Feb 2024
Compass: A Decentralized Scheduler for Latency-Sensitive ML Workflows
Compass: A Decentralized Scheduler for Latency-Sensitive ML Workflows
Yuting Yang
Andrea Merlina
Weijia Song
Tiancheng Yuan
Ken Birman
Roman Vitenberg
232
0
0
27 Feb 2024
Training DNN Models over Heterogeneous Clusters with Optimal Performance
Training DNN Models over Heterogeneous Clusters with Optimal Performance
Chengyi Nie
Jessica Maghakian
Zhenhua Liu
220
1
0
07 Feb 2024
Towards Deterministic End-to-end Latency for Medical AI Systems in
  NVIDIA Holoscan
Towards Deterministic End-to-end Latency for Medical AI Systems in NVIDIA Holoscan
Soham Sinha
Shekhar Dwivedi
Mahdi Azizian
120
5
0
06 Feb 2024
ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language
  Models
ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language ModelsUSENIX Symposium on Operating Systems Design and Implementation (OSDI), 2024
Yao Fu
Leyang Xue
Yeqi Huang
Andrei-Octavian Brabete
Dmitrii Ustiugov
Yuvraj Patel
Luo Mai
223
6
0
25 Jan 2024
MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache
MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache
Leyang Xue
Yao Fu
Zhan Lu
Luo Mai
Mahesh K. Marina
MoE
420
4
0
25 Jan 2024
CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM
  Inference
CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference
Suyi Li
Hanfeng Lu
Tianyuan Wu
Minchen Yu
Qizhen Weng
Xusheng Chen
Yizhou Shan
Binhang Yuan
Wei Wang
314
31
0
20 Jan 2024
Towards providing reliable job completion time predictions using PCS
Towards providing reliable job completion time predictions using PCS
Abdullah Bin Faisal
Noah Martin
Hafiz Mohsin Bashir
Swaminathan Lamelas
Fahad R. Dogar
245
0
0
18 Jan 2024
123
Next
Page 1 of 3