ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2006.02464
  4. Cited By
Serving DNNs like Clockwork: Performance Predictability from the Bottom
  Up

Serving DNNs like Clockwork: Performance Predictability from the Bottom Up

3 June 2020
A. Gujarati
Reza Karimi
Safya Alzayat
Wei Hao
Antoine Kaufmann
Ymir Vigfusson
Jonathan Mace
ArXivPDFHTML

Papers citing "Serving DNNs like Clockwork: Performance Predictability from the Bottom Up"

27 / 27 papers shown
Title
Patchwork: A Unified Framework for RAG Serving
Patchwork: A Unified Framework for RAG Serving
Bodun Hu
Luis Pabon
Saurabh Agarwal
Aditya Akella
21
0
0
01 May 2025
Circinus: Efficient Query Planner for Compound ML Serving
Circinus: Efficient Query Planner for Compound ML Serving
Banruo Liu
Wei-Yu Lin
Minghao Fang
Yihan Jiang
Fan Lai
LRM
34
0
0
23 Apr 2025
LithOS: An Operating System for Efficient Machine Learning on GPUs
LithOS: An Operating System for Efficient Machine Learning on GPUs
Patrick H. Coppock
Brian Zhang
Eliot H. Solomon
Vasilis Kypriotis
Leon Yang
Bikash Sharma
Dan Schatzberg
Todd C. Mowry
Dimitrios Skarlatos
27
0
0
21 Apr 2025
Bandwidth Allocation for Cloud-Augmented Autonomous Driving
Bandwidth Allocation for Cloud-Augmented Autonomous Driving
Peter Schafhalter
Alexander Krentsel
Joseph E. Gonzalez
Sylvia Ratnasamy
S. Shenker
Ion Stoica
76
0
0
26 Mar 2025
iServe: An Intent-based Serving System for LLMs
iServe: An Intent-based Serving System for LLMs
Dimitrios Liakopoulos
Tianrui Hu
Prasoon Sinha
N. Yadwadkar
VLM
160
0
0
08 Jan 2025
Loki: A System for Serving ML Inference Pipelines with Hardware and
  Accuracy Scaling
Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy Scaling
Sohaib Ahmad
Hui Guan
Ramesh K. Sitaraman
40
4
0
04 Jul 2024
Teola: Towards End-to-End Optimization of LLM-based Applications
Teola: Towards End-to-End Optimization of LLM-based Applications
Xin Tan
Yimin Jiang
Yitao Yang
Hong-Yu Xu
65
5
0
29 Jun 2024
MOPAR: A Model Partitioning Framework for Deep Learning Inference
  Services on Serverless Platforms
MOPAR: A Model Partitioning Framework for Deep Learning Inference Services on Serverless Platforms
Jiaang Duan
Shiyou Qian
Dingyu Yang
Hanwen Hu
Jian Cao
Guangtao Xue
MoE
29
1
0
03 Apr 2024
Sponge: Inference Serving with Dynamic SLOs Using In-Place Vertical
  Scaling
Sponge: Inference Serving with Dynamic SLOs Using In-Place Vertical Scaling
Kamran Razavi
Saeid Ghafouri
Max Mühlhäuser
Pooyan Jamshidi
Lin Wang
26
3
0
31 Mar 2024
FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning
FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning
Xupeng Miao
Gabriele Oliaro
Xinhao Cheng
Vineeth Kada
Ruohan Gao
...
April Yang
Yingcheng Wang
Mengdi Wu
Colin Unger
Zhihao Jia
MoE
94
9
0
29 Feb 2024
Compass: A Decentralized Scheduler for Latency-Sensitive ML Workflows
Compass: A Decentralized Scheduler for Latency-Sensitive ML Workflows
Yuting Yang
Andrea Merlina
Weijia Song
Tiancheng Yuan
Ken Birman
Roman Vitenberg
41
0
0
27 Feb 2024
Towards providing reliable job completion time predictions using PCS
Towards providing reliable job completion time predictions using PCS
Abdullah Bin Faisal
Noah Martin
Hafiz Mohsin Bashir
Swaminathan Lamelas
Fahad R. Dogar
20
0
0
18 Jan 2024
Graft: Efficient Inference Serving for Hybrid Deep Learning with SLO
  Guarantees via DNN Re-alignment
Graft: Efficient Inference Serving for Hybrid Deep Learning with SLO Guarantees via DNN Re-alignment
Jing Wu
Lin Wang
Qirui Jin
Fangming Liu
23
11
0
17 Dec 2023
Splitwise: Efficient generative LLM inference using phase splitting
Splitwise: Efficient generative LLM inference using phase splitting
Pratyush Patel
Esha Choukse
Chaojie Zhang
Aashaka Shah
Íñigo Goiri
Saeed Maleki
Ricardo Bianchini
47
197
0
30 Nov 2023
Punica: Multi-Tenant LoRA Serving
Punica: Multi-Tenant LoRA Serving
Lequn Chen
Zihao Ye
Yongji Wu
Danyang Zhuo
Luis Ceze
Arvind Krishnamurthy
44
34
0
28 Oct 2023
RT-LM: Uncertainty-Aware Resource Management for Real-Time Inference of
  Language Models
RT-LM: Uncertainty-Aware Resource Management for Real-Time Inference of Language Models
Yufei Li
Zexin Li
Wei Yang
Cong Liu
19
6
0
12 Sep 2023
Pareto-Secure Machine Learning (PSML): Fingerprinting and Securing
  Inference Serving Systems
Pareto-Secure Machine Learning (PSML): Fingerprinting and Securing Inference Serving Systems
Debopam Sanyal
Jui-Tse Hung
Manavi Agrawal
Prahlad Jasti
Shahab Nikkhoo
S. Jha
Tianhao Wang
Sibin Mohan
Alexey Tumanov
39
0
0
03 Jul 2023
S$^{3}$: Increasing GPU Utilization during Generative Inference for
  Higher Throughput
S3^{3}3: Increasing GPU Utilization during Generative Inference for Higher Throughput
Yunho Jin
Chun-Feng Wu
David Brooks
Gu-Yeon Wei
29
62
0
09 Jun 2023
Clover: Toward Sustainable AI with Carbon-Aware Machine Learning
  Inference Service
Clover: Toward Sustainable AI with Carbon-Aware Machine Learning Inference Service
Baolin Li
S. Samsi
V. Gadepally
Devesh Tiwari
22
27
0
19 Apr 2023
MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep
  Learning Clusters
MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters
Yihao Zhao
Xin Liu
Shufan Liu
Xiang Li
Yibo Zhu
Gang Huang
Xuanzhe Liu
Xin Jin
27
11
0
24 Mar 2023
Kernel-as-a-Service: A Serverless Interface to GPUs
Kernel-as-a-Service: A Serverless Interface to GPUs
Nathan Pemberton
Anton Zabreyko
Zhoujie Ding
R. Katz
Joseph E. Gonzalez
17
8
0
15 Dec 2022
KAIROS: Building Cost-Efficient Machine Learning Inference Systems with
  Heterogeneous Cloud Resources
KAIROS: Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources
Baolin Li
S. Samsi
V. Gadepally
Devesh Tiwari
19
11
0
12 Oct 2022
SplitPlace: AI Augmented Splitting and Placement of Large-Scale Neural
  Networks in Mobile Edge Environments
SplitPlace: AI Augmented Splitting and Placement of Large-Scale Neural Networks in Mobile Edge Environments
Shreshth Tuli
G. Casale
N. Jennings
16
31
0
21 May 2022
Serving DNN Models with Multi-Instance GPUs: A Case of the
  Reconfigurable Machine Scheduling Problem
Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem
Cheng Tan
Zhichao Li
Jian Zhang
Yunyin Cao
Sikai Qi
Zherui Liu
Yibo Zhu
Chuanxiong Guo
21
34
0
18 Sep 2021
Serverless in the Wild: Characterizing and Optimizing the Serverless
  Workload at a Large Cloud Provider
Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider
Mohammad Shahrad
Rodrigo Fonseca
Íñigo Goiri
G. Chaudhry
Paul Batum
Jason Cooke
Eduardo Laureano
Colby Tresness
M. Russinovich
Ricardo Bianchini
81
601
0
06 Mar 2020
ALERT: Accurate Learning for Energy and Timeliness
ALERT: Accurate Learning for Energy and Timeliness
Chengcheng Wan
M. Santriaji
E. Rogers
H. Hoffmann
Michael Maire
Shan Lu
AI4CE
32
40
0
31 Oct 2019
Aggregated Residual Transformations for Deep Neural Networks
Aggregated Residual Transformations for Deep Neural Networks
Saining Xie
Ross B. Girshick
Piotr Dollár
Z. Tu
Kaiming He
297
10,216
0
16 Nov 2016
1