Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

7 April 2020

Papers citing "Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers"

15 / 15 papers shown

Title
OmniLearn: A Framework for Distributed Deep Learning over Heterogeneous Clusters S. Tyagi Prateek Sharma 63 0 0 21 Mar 2025
PowerTrain: Fast, Generalizable Time and Power Prediction Models to Optimize DNN Training on Accelerated Edges Prashanthi S.K. Saisamarth Taluri Beautlin S Lakshya Karwa Yogesh L. Simmhan 25 1 0 18 Jul 2024
Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms Zhongyi Lin Ning Sun Pallab Bhattacharya Xizhou Feng Louis Feng John Douglas Owens 32 1 0 19 Apr 2024
Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances Jiangfei Duan Ziang Song Xupeng Miao Xiaoli Xi Dahua Lin Harry Xu Minjia Zhang Zhihao Jia 39 10 0 21 Mar 2024
On the Burstiness of Distributed Machine Learning Traffic Natchanon Luangsomboon Fahimeh Fazel Jorg Liebeherr A. Sobhani Shichao Guan Xingjun Chu 17 1 0 30 Dec 2023
Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models Longteng Zhang Xiang Liu Zeyu Li Xinglin Pan Peijie Dong ... Rui Guo Xin Wang Qiong Luo S. Shi Xiaowen Chu 41 7 0 07 Nov 2023
Taming Resource Heterogeneity In Distributed ML Training With Dynamic Batching S. Tyagi Prateek Sharma 16 22 0 20 May 2023
Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks Mingyu Liang Wenyin Fu Louis Feng Zhongyi Lin P. Panakanti Shengbao Zheng Srinivas Sridharan Christina Delimitrou 16 12 0 16 Dec 2022
FuncPipe: A Pipelined Serverless Framework for Fast and Cost-efficient Training of Deep Learning Models Yunzhuo Liu Bo Jiang Tian Guo Zimeng Huang Wen-ping Ma Xinbing Wang Chenghu Zhou 17 9 0 28 Apr 2022
Building a Performance Model for Deep Learning Recommendation Model Training on GPUs Zhongyi Lin Louis Feng E. K. Ardestani Jaewon Lee J. Lundell Changkyu Kim A. Kejariwal John Douglas Owens 22 19 0 19 Jan 2022
On the Future of Cloud Engineering David Bermbach A. Chandra C. Krintz A. Gokhale Aleksander Slominski L. Thamsen Everton Cavalcante Tian Guo Ivona Brandić R. Wolski 27 23 0 19 Aug 2021
Quantifying and Improving Performance of Distributed Deep Learning with Cloud Storage Nicholas Krichevsky M. S. Louis Tian Guo 17 9 0 13 Aug 2021
Sync-Switch: Hybrid Parameter Synchronization for Distributed Deep Learning Shijian Li Oren Mangoubi Lijie Xu Tian Guo 16 15 0 16 Apr 2021
BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training Letian Zhao Rui Xu Tianqi Wang Teng Tian Xiaotian Wang Wei Wu Chio-in Ieong Xi Jin MoE 11 8 0 23 Dec 2020
Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training Liang Luo Jacob Nelson Luis Ceze Amar Phanishayee Arvind Krishnamurthy 64 120 0 21 May 2018