90

CARMA: Collocation-Aware Resource Manager with GPU Memory Estimator

Main:11 Pages
14 Figures
Bibliography:4 Pages
4 Tables
Appendix:3 Pages
Abstract

Studies conducted on enterprise-scale infrastructure have shown that GPUs -- the core computational resource for deep learning (DL) training -- are often significantly underutilized. DL task collocation on GPUs is an opportunity to address this challenge. However, it may result in (1) out-of-memory crashes for the subsequently arriving task and (2) slowdowns for all tasks sharing the GPU due to resource interference. The former challenge poses a threat to robustness, while the latter affects the quality of service and energy efficiency.

View on arXiv
Comments on this paper