158

Wide neural networks: From non-gaussian random fields at initialization to the NTK geometry of training

Abstract

Recent developments in applications of artificial neural networks with over n=1014n=10^{14} parameters make it extremely important to study the large nn behaviour of such networks. Most works studying wide neural networks have focused on the infinite width n+n \to +\infty limit of such networks and have shown that, at initialization, they correspond to Gaussian processes. In this work we will study their behavior for large, but finite nn. Our main contributions are the following: (1) The computation of the corrections to Gaussianity in terms of an asymptotic series in n12n^{-\frac{1}{2}}. The coefficients in this expansion are determined by the statistics of parameter initialization and by the activation function. (2) Controlling the evolution of the outputs of finite width nn networks, during training, by computing deviations from the limiting infinite width case (in which the network evolves through a linear flow). This improves previous estimates and yields sharper decay rates for the (finite width) NTK in terms of nn, valid during the entire training procedure. As a corollary, we also prove that, with arbitrarily high probability, the training of sufficiently wide neural networks converges to a global minimum of the corresponding quadratic loss function. (3) Estimating how the deviations from Gaussianity evolve with training in terms of nn. In particular, using a certain metric in the space of measures we find that, along training, the resulting measure is within n12(logn)1+n^{-\frac{1}{2}}(\log n)^{1+} of the time dependent Gaussian process corresponding to the infinite width network (which is explicitly given by precomposing the initial Gaussian process with the linear flow corresponding to training in the infinite width limit).

View on arXiv
Comments on this paper