Provable Convergence of Nesterov's Accelerated Gradient Method for Over-Parameterized Neural Networks

5 July 2021

Abstract

Neural networks have achieved tremendous empirical success in many areas. It has been observed that a randomly initialized neural network trained by first-order methods is able to achieve near-zero training loss, although its loss landscape is non-convex and non-smooth. There are few theoretical explanations for this phenomenon. Recently, some attempts have been made to bridge this gap between practice and theory by analyzing the trajectories of gradient descent~(GD) and heavy-ball method~(HB) in an over-parameterized regime. In this work, we make further progress by considering Nesterov's accelerated gradient method~(NAG) with a constant momentum parameter. We analyze its convergence for an over-parameterized two-layer fully connected neural network with ReLU activation. Specifically, we prove that the training error of NAG converges to zero at a non-asymptotic linear convergence rate $(1-\Theta(1/\sqrt{\kappa}))^t$ after $t$ iterations, where $\kappa > 1$ is determined by the initialization and the architecture of the neural network. Besides, we present a comparison between NAG and the existing convergence results of GD and HB. Our theoretical results show that NAG achieves an acceleration over GD and its convergence rate is comparable to HB. Furthermore, the numerical experiments validate the correctness of our theoretical analysis.

View on arXiv

Comments on this paper