212

A Novel Co-design Peta-scale Heterogeneous Cluster for Deep Learning Training

Abstract

With advancements of deep convolution Neural Networks~(CNNs), their training demands higher and higher computational capacity. On the other hand, the commodity GPUs card. commonly-used accelerator, is more and more expensive. Consequently, building an affordable distributed system with powerful computational capacity becomes a key factor for large scale deep learning (DL) training tasks. In this paper, we present an ad hoc distributed heterogeneous system, which is co-designed for deep learning (DL) algorithm, hardware and software. Considered characteristics of DL training algorithms and inspired by Harvard architecture, we design and build a novel distributed system, called Manoa, which has new distributed architecture and connection topology. Manoa consists of 128 Nvidia Tesla P100 GPUs and has over 1.2 PFLOPS of single precision floating-point. What is more, the price of the system is less than one million dollar. Meanwhile, in order to exploit Manoa, we first propose job sever parallel software frame, called MiMatrix. Compared to the parameter server frame, the center node of MiMatrix, referred to as the job server, only undertakes all of controlling, scheduling and monitoring, and I/O tasks without weight data transfer during model update at each epoch. Consequently, it avoids to the bandwidth bottleneck of the center node in the parameter sever software framework. In this paper, we also propose a new AllReduce algorithm, GPUDirect RDMA-Aware AllReduce (GDRAA), in which both computation and handshake message are O(1)O(1) and the number of synchronization is two, the theoretical minimum number. Owe to the dedicated co-design of hardware, software and algorithm, the MiMatrix effectively and efficiently utilizes computational capacity and bandwidth of Manoa, and experimental results of Resnet50 on Imagenet-1K dataset have demonstrated state-of-the-art.

View on arXiv
Comments on this paper