ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1801.08030
16
30

On Scale-out Deep Learning Training for Cloud and HPC

24 January 2018
Srinivas Sridharan
K. Vaidyanathan
Dhiraj D. Kalamkar
Dipankar Das
Mikhail E. Smorkalov
Mikhail Shiryaev
Dheevatsa Mudigere
Naveen Mellempudi
Sasikanth Avancha
Bharat Kaul
Pradeep Dubey
    BDL
ArXivPDFHTML
Abstract

The exponential growth in use of large deep neural networks has accelerated the need for training these deep neural networks in hours or even minutes. This can only be achieved through scalable and efficient distributed training, since a single node/card cannot satisfy the compute, memory, and I/O requirements of today's state-of-the-art deep neural networks. However, scaling synchronous Stochastic Gradient Descent (SGD) is still a challenging problem and requires continued research/development. This entails innovations spanning algorithms, frameworks, communication libraries, and system design. In this paper, we describe the philosophy, design, and implementation of Intel Machine Learning Scalability Library (MLSL) and present proof-points demonstrating scaling DL training on 100s to 1000s of nodes across Cloud and HPC systems.

View on arXiv
Comments on this paper