ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2004.03072
  4. Cited By
Characterizing and Modeling Distributed Training with Transient Cloud
  GPU Servers

Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

7 April 2020
Shijian Li
R. Walls
Tian Guo
ArXivPDFHTML

Papers citing "Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers"

15 / 15 papers shown
Title
OmniLearn: A Framework for Distributed Deep Learning over Heterogeneous Clusters
OmniLearn: A Framework for Distributed Deep Learning over Heterogeneous Clusters
S. Tyagi
Prateek Sharma
63
0
0
21 Mar 2025
PowerTrain: Fast, Generalizable Time and Power Prediction Models to
  Optimize DNN Training on Accelerated Edges
PowerTrain: Fast, Generalizable Time and Power Prediction Models to Optimize DNN Training on Accelerated Edges
Prashanthi S.K.
Saisamarth Taluri
Beautlin S
Lakshya Karwa
Yogesh L. Simmhan
25
1
0
18 Jul 2024
Towards Universal Performance Modeling for Machine Learning Training on
  Multi-GPU Platforms
Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms
Zhongyi Lin
Ning Sun
Pallab Bhattacharya
Xizhou Feng
Louis Feng
John Douglas Owens
32
1
0
19 Apr 2024
Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible
  Instances
Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances
Jiangfei Duan
Ziang Song
Xupeng Miao
Xiaoli Xi
Dahua Lin
Harry Xu
Minjia Zhang
Zhihao Jia
39
10
0
21 Mar 2024
On the Burstiness of Distributed Machine Learning Traffic
On the Burstiness of Distributed Machine Learning Traffic
Natchanon Luangsomboon
Fahimeh Fazel
Jorg Liebeherr
A. Sobhani
Shichao Guan
Xingjun Chu
17
1
0
30 Dec 2023
Dissecting the Runtime Performance of the Training, Fine-tuning, and
  Inference of Large Language Models
Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models
Longteng Zhang
Xiang Liu
Zeyu Li
Xinglin Pan
Peijie Dong
...
Rui Guo
Xin Wang
Qiong Luo
S. Shi
Xiaowen Chu
41
7
0
07 Nov 2023
Taming Resource Heterogeneity In Distributed ML Training With Dynamic
  Batching
Taming Resource Heterogeneity In Distributed ML Training With Dynamic Batching
S. Tyagi
Prateek Sharma
16
22
0
20 May 2023
Mystique: Enabling Accurate and Scalable Generation of Production AI
  Benchmarks
Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks
Mingyu Liang
Wenyin Fu
Louis Feng
Zhongyi Lin
P. Panakanti
Shengbao Zheng
Srinivas Sridharan
Christina Delimitrou
16
12
0
16 Dec 2022
FuncPipe: A Pipelined Serverless Framework for Fast and Cost-efficient
  Training of Deep Learning Models
FuncPipe: A Pipelined Serverless Framework for Fast and Cost-efficient Training of Deep Learning Models
Yunzhuo Liu
Bo Jiang
Tian Guo
Zimeng Huang
Wen-ping Ma
Xinbing Wang
Chenghu Zhou
17
9
0
28 Apr 2022
Building a Performance Model for Deep Learning Recommendation Model
  Training on GPUs
Building a Performance Model for Deep Learning Recommendation Model Training on GPUs
Zhongyi Lin
Louis Feng
E. K. Ardestani
Jaewon Lee
J. Lundell
Changkyu Kim
A. Kejariwal
John Douglas Owens
22
19
0
19 Jan 2022
On the Future of Cloud Engineering
On the Future of Cloud Engineering
David Bermbach
A. Chandra
C. Krintz
A. Gokhale
Aleksander Slominski
L. Thamsen
Everton Cavalcante
Tian Guo
Ivona Brandić
R. Wolski
27
23
0
19 Aug 2021
Quantifying and Improving Performance of Distributed Deep Learning with
  Cloud Storage
Quantifying and Improving Performance of Distributed Deep Learning with Cloud Storage
Nicholas Krichevsky
M. S. Louis
Tian Guo
17
9
0
13 Aug 2021
Sync-Switch: Hybrid Parameter Synchronization for Distributed Deep
  Learning
Sync-Switch: Hybrid Parameter Synchronization for Distributed Deep Learning
Shijian Li
Oren Mangoubi
Lijie Xu
Tian Guo
16
15
0
16 Apr 2021
BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training
BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training
Letian Zhao
Rui Xu
Tianqi Wang
Teng Tian
Xiaotian Wang
Wei Wu
Chio-in Ieong
Xi Jin
MoE
11
8
0
23 Dec 2020
Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural
  Network Training
Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training
Liang Luo
Jacob Nelson
Luis Ceze
Amar Phanishayee
Arvind Krishnamurthy
64
120
0
21 May 2018
1