Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2004.03072
Cited By
Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers
7 April 2020
Shijian Li
R. Walls
Tian Guo
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers"
15 / 15 papers shown
Title
OmniLearn: A Framework for Distributed Deep Learning over Heterogeneous Clusters
S. Tyagi
Prateek Sharma
63
0
0
21 Mar 2025
PowerTrain: Fast, Generalizable Time and Power Prediction Models to Optimize DNN Training on Accelerated Edges
Prashanthi S.K.
Saisamarth Taluri
Beautlin S
Lakshya Karwa
Yogesh L. Simmhan
25
1
0
18 Jul 2024
Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms
Zhongyi Lin
Ning Sun
Pallab Bhattacharya
Xizhou Feng
Louis Feng
John Douglas Owens
32
1
0
19 Apr 2024
Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances
Jiangfei Duan
Ziang Song
Xupeng Miao
Xiaoli Xi
Dahua Lin
Harry Xu
Minjia Zhang
Zhihao Jia
39
10
0
21 Mar 2024
On the Burstiness of Distributed Machine Learning Traffic
Natchanon Luangsomboon
Fahimeh Fazel
Jorg Liebeherr
A. Sobhani
Shichao Guan
Xingjun Chu
17
1
0
30 Dec 2023
Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models
Longteng Zhang
Xiang Liu
Zeyu Li
Xinglin Pan
Peijie Dong
...
Rui Guo
Xin Wang
Qiong Luo
S. Shi
Xiaowen Chu
41
7
0
07 Nov 2023
Taming Resource Heterogeneity In Distributed ML Training With Dynamic Batching
S. Tyagi
Prateek Sharma
16
22
0
20 May 2023
Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks
Mingyu Liang
Wenyin Fu
Louis Feng
Zhongyi Lin
P. Panakanti
Shengbao Zheng
Srinivas Sridharan
Christina Delimitrou
16
12
0
16 Dec 2022
FuncPipe: A Pipelined Serverless Framework for Fast and Cost-efficient Training of Deep Learning Models
Yunzhuo Liu
Bo Jiang
Tian Guo
Zimeng Huang
Wen-ping Ma
Xinbing Wang
Chenghu Zhou
17
9
0
28 Apr 2022
Building a Performance Model for Deep Learning Recommendation Model Training on GPUs
Zhongyi Lin
Louis Feng
E. K. Ardestani
Jaewon Lee
J. Lundell
Changkyu Kim
A. Kejariwal
John Douglas Owens
22
19
0
19 Jan 2022
On the Future of Cloud Engineering
David Bermbach
A. Chandra
C. Krintz
A. Gokhale
Aleksander Slominski
L. Thamsen
Everton Cavalcante
Tian Guo
Ivona Brandić
R. Wolski
27
23
0
19 Aug 2021
Quantifying and Improving Performance of Distributed Deep Learning with Cloud Storage
Nicholas Krichevsky
M. S. Louis
Tian Guo
17
9
0
13 Aug 2021
Sync-Switch: Hybrid Parameter Synchronization for Distributed Deep Learning
Shijian Li
Oren Mangoubi
Lijie Xu
Tian Guo
16
15
0
16 Apr 2021
BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training
Letian Zhao
Rui Xu
Tianqi Wang
Teng Tian
Xiaotian Wang
Wei Wu
Chio-in Ieong
Xi Jin
MoE
11
8
0
23 Dec 2020
Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training
Liang Luo
Jacob Nelson
Luis Ceze
Amar Phanishayee
Arvind Krishnamurthy
64
120
0
21 May 2018
1