683

Provable Acceleration of Neural Net Training via Polyak's Momentum

International Conference on Machine Learning (ICML), 2020
Abstract

Incorporating a so-called "momentum" dynamic in gradient descent methods is widely used in neural net training as it has been broadly observed that, at least empirically, it often leads to significantly faster convergence. At the same time, there are very few theoretical guarantees in the literature to explain this apparent acceleration effect. In this paper we show that Polyak's momentum, in combination with over-parameterization of the model, helps achieve faster convergence in training a one-layer ReLU network on nn examples. We show specifically that gradient descent with Polyak's momentum decreases the initial training error at a rate much faster than that of vanilla gradient descent. We provide a bound for a fixed sample size nn, and we show that gradient descent with Polyak's momentum converges at an accelerated rate to a small error that is controllable by the number of neurons mm. Prior work [DZPS19] showed that using vanilla gradient descent, and with a similar method of over-parameterization, the error decays as (1κn)t(1-\kappa_n)^t after tt iterations, where κn\kappa_n is a problem-specific parameter. Our result shows that with the appropriate choice of parameters one has a rate of (1κn)t(1-\sqrt{\kappa_n})^t. This work establishes that momentum does indeed speed up neural net training.

View on arXiv
Comments on this paper