Serving DNNs like Clockwork: Performance Predictability from the Bottom Up

3 June 2020

Papers citing "Serving DNNs like Clockwork: Performance Predictability from the Bottom Up"

27 / 27 papers shown

Title
Patchwork: A Unified Framework for RAG Serving Bodun Hu Luis Pabon Saurabh Agarwal Aditya Akella 21 0 0 01 May 2025
Circinus: Efficient Query Planner for Compound ML Serving Banruo Liu Wei-Yu Lin Minghao Fang Yihan Jiang Fan Lai LRM 34 0 0 23 Apr 2025
LithOS: An Operating System for Efficient Machine Learning on GPUs Patrick H. Coppock Brian Zhang Eliot H. Solomon Vasilis Kypriotis Leon Yang Bikash Sharma Dan Schatzberg Todd C. Mowry Dimitrios Skarlatos 27 0 0 21 Apr 2025
Bandwidth Allocation for Cloud-Augmented Autonomous Driving Peter Schafhalter Alexander Krentsel Joseph E. Gonzalez Sylvia Ratnasamy S. Shenker Ion Stoica 76 0 0 26 Mar 2025
iServe: An Intent-based Serving System for LLMs Dimitrios Liakopoulos Tianrui Hu Prasoon Sinha N. Yadwadkar VLM 160 0 0 08 Jan 2025
Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy Scaling Sohaib Ahmad Hui Guan Ramesh K. Sitaraman 40 4 0 04 Jul 2024
Teola: Towards End-to-End Optimization of LLM-based Applications Xin Tan Yimin Jiang Yitao Yang Hong-Yu Xu 65 5 0 29 Jun 2024
MOPAR: A Model Partitioning Framework for Deep Learning Inference Services on Serverless Platforms Jiaang Duan Shiyou Qian Dingyu Yang Hanwen Hu Jian Cao Guangtao Xue MoE 29 1 0 03 Apr 2024
Sponge: Inference Serving with Dynamic SLOs Using In-Place Vertical Scaling Kamran Razavi Saeid Ghafouri Max Mühlhäuser Pooyan Jamshidi Lin Wang 26 3 0 31 Mar 2024
FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning Xupeng Miao Gabriele Oliaro Xinhao Cheng Vineeth Kada Ruohan Gao ... April Yang Yingcheng Wang Mengdi Wu Colin Unger Zhihao Jia MoE 94 9 0 29 Feb 2024
Compass: A Decentralized Scheduler for Latency-Sensitive ML Workflows Yuting Yang Andrea Merlina Weijia Song Tiancheng Yuan Ken Birman Roman Vitenberg 41 0 0 27 Feb 2024
Towards providing reliable job completion time predictions using PCS Abdullah Bin Faisal Noah Martin Hafiz Mohsin Bashir Swaminathan Lamelas Fahad R. Dogar 20 0 0 18 Jan 2024
Graft: Efficient Inference Serving for Hybrid Deep Learning with SLO Guarantees via DNN Re-alignment Jing Wu Lin Wang Qirui Jin Fangming Liu 23 11 0 17 Dec 2023
Splitwise: Efficient generative LLM inference using phase splitting Pratyush Patel Esha Choukse Chaojie Zhang Aashaka Shah Íñigo Goiri Saeed Maleki Ricardo Bianchini 47 197 0 30 Nov 2023
Punica: Multi-Tenant LoRA Serving Lequn Chen Zihao Ye Yongji Wu Danyang Zhuo Luis Ceze Arvind Krishnamurthy 44 34 0 28 Oct 2023
RT-LM: Uncertainty-Aware Resource Management for Real-Time Inference of Language Models Yufei Li Zexin Li Wei Yang Cong Liu 19 6 0 12 Sep 2023
Pareto-Secure Machine Learning (PSML): Fingerprinting and Securing Inference Serving Systems Debopam Sanyal Jui-Tse Hung Manavi Agrawal Prahlad Jasti Shahab Nikkhoo S. Jha Tianhao Wang Sibin Mohan Alexey Tumanov 39 0 0 03 Jul 2023
$S$^{3}$: Increasing GPU Utilization during Generative Inference for Higher Throughput$ S $^{3}$ : Increasing GPU Utilization during Generative Inference for Higher Throughput Yunho Jin Chun-Feng Wu David Brooks Gu-Yeon Wei 29 62 0 09 Jun 2023
Clover: Toward Sustainable AI with Carbon-Aware Machine Learning Inference Service Baolin Li S. Samsi V. Gadepally Devesh Tiwari 22 27 0 19 Apr 2023
MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters Yihao Zhao Xin Liu Shufan Liu Xiang Li Yibo Zhu Gang Huang Xuanzhe Liu Xin Jin 27 11 0 24 Mar 2023
Kernel-as-a-Service: A Serverless Interface to GPUs Nathan Pemberton Anton Zabreyko Zhoujie Ding R. Katz Joseph E. Gonzalez 17 8 0 15 Dec 2022
KAIROS: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources Baolin Li S. Samsi V. Gadepally Devesh Tiwari 19 11 0 12 Oct 2022
SplitPlace: AI Augmented Splitting and Placement of Large-Scale Neural Networks in Mobile Edge Environments Shreshth Tuli G. Casale N. Jennings 16 31 0 21 May 2022
Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem Cheng Tan Zhichao Li Jian Zhang Yunyin Cao Sikai Qi Zherui Liu Yibo Zhu Chuanxiong Guo 21 34 0 18 Sep 2021
Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider Mohammad Shahrad Rodrigo Fonseca Íñigo Goiri G. Chaudhry Paul Batum Jason Cooke Eduardo Laureano Colby Tresness M. Russinovich Ricardo Bianchini 81 601 0 06 Mar 2020
ALERT: Accurate Learning for Energy and Timeliness Chengcheng Wan M. Santriaji E. Rogers H. Hoffmann Michael Maire Shan Lu AI4CE 32 40 0 31 Oct 2019
Aggregated Residual Transformations for Deep Neural Networks Saining Xie Ross B. Girshick Piotr Dollár Z. Tu Kaiming He 297 10,216 0 16 Nov 2016