Ensemble-Compression: A New Method for Parallel Training of Deep Neural
Networks
- FedML
In recent year, parallel implementations have been used to speed up the training of deep neural networks (DNN). Typically, the parameters of the local models are periodically communicated and averaged to get a global model until the training curve converges (denoted as MA-DNN). However, since DNN is a highly non-convex model, the global model obtained by averaging parameters does not have guarantee on its performance improvement over the local models and might even be worse than the average performance of the local models, which leads to the slow-down of convergence and the decrease of the final performance. To tackle this problem, we propose a new parallel training method called \emph{Ensemble-Compression} (denoted as EC-DNN). Specifically, we propose to aggregate the local models by ensemble, i.e., the outputs of the local models are averaged instead of the parameters. Considering that the widely used loss functions are convex to the output of the model, the performance of the global model obtained in this way is guaranteed to be at least as good as the average performance of local models. However, the size of the global model will increase after each ensemble and may explode after multiple rounds of ensembles. Thus, we conduct model compression after each ensemble, to ensure the size of the global model to be the same as the local models. We conducted experiments on a benchmark dataset. The experimental results demonstrate that our proposed EC-DNN can stably achieve better performance than MA-DNN.
View on arXiv