Enabling Compute-Communication Overlap in Distributed Deep Learning
Training Platforms

Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms

30 June 2020

Srinivas Sridharan

Amoghavarsha Suresh

Papers citing "Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms"

11 / 11 papers shown

Title
Optimizing Large Model Training through Overlapped Activation Recomputation Ping Chen Wenjie Zhang Shuibing He Yingjie Gu Zhuwei Peng ... Yi Zheng Zhefeng Wang Yanlong Yin Gang Chen Gang Chen 33 5 0 13 Jun 2024
PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices Si Ung Noh Junguk Hong Chaemin Lim Seong-Yeol Park Jeehyun Kim Hanjun Kim Youngsok Kim Jinho Lee 34 6 0 13 Apr 2024
UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming Hao Lin Ke Wu Jie Li Jun Yu Li Wu-Jun Li 26 1 0 31 Jul 2023
Cloudless-Training: A Framework to Improve Efficiency of Geo-Distributed ML Training W. Tan Xiao Shi Cunchi Lv Xiaofang Zhao FedML 15 1 0 09 Mar 2023
Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism Xupeng Miao Yujie Wang Youhe Jiang Chunan Shi Xiaonan Nie Hailin Zhang Bin Cui GNN MoE 32 60 0 25 Nov 2022
HammingMesh: A Network Topology for Large-Scale Deep Learning Torsten Hoefler Tommaso Bonato Daniele De Sensi Salvatore Di Girolamo Shigang Li Marco Heddes Jon Belk Deepak Goel Miguel Castro Steve Scott 3DH GNN AI4CE 18 20 0 03 Sep 2022
Impact of RoCE Congestion Control Policies on Distributed Training of DNNs Tarannum Khan Saeed Rashidi Srinivas Sridharan Pallavi Shurpali Aditya Akella T. Krishna OOD 13 11 0 22 Jul 2022
Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models Saeed Rashidi William Won S. Srinivasan Srinivas Sridharan T. Krishna GNN 17 29 0 09 Oct 2021
RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing Liu Ke Udit Gupta Carole-Jean Wu B. Cho Mark Hempstead ... Dheevatsa Mudigere Maxim Naumov Martin D. Schatz M. Smelyanskiy Xiaodong Wang 41 212 0 30 Dec 2019
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism M. Shoeybi M. Patwary Raul Puri P. LeGresley Jared Casper Bryan Catanzaro MoE 245 1,817 0 17 Sep 2019
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Yonghui Wu M. Schuster Z. Chen Quoc V. Le Mohammad Norouzi ... Alex Rudnick Oriol Vinyals G. Corrado Macduff Hughes J. Dean AIMat 716 6,743 0 26 Sep 2016