ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2206.08065
10
1

Large-width asymptotics for ReLU neural networks with ααα-Stable initializations

16 June 2022
Stefano Favaro
S. Fortini
Stefano Peluchetti
ArXivPDFHTML
Abstract

There is a recent and growing literature on large-width asymptotic properties of Gaussian neural networks (NNs), namely NNs whose weights are initialized as Gaussian distributions. Two popular problems are: i) the study of the large-width distributions of NNs, which characterizes the infinitely wide limit of a rescaled NN in terms of a Gaussian stochastic process; ii) the study of the large-width training dynamics of NNs, which characterizes the infinitely wide dynamics in terms of a deterministic kernel, referred to as the neural tangent kernel (NTK), and shows that, for a sufficiently large width, the gradient descent achieves zero training error at a linear rate. In this paper, we consider these problems for α\alphaα-Stable NNs, namely NNs whose weights are initialized as α\alphaα-Stable distributions with α∈(0,2]\alpha\in(0,2]α∈(0,2]. First, for α\alphaα-Stable NNs with a ReLU activation function, we show that if the NN's width goes to infinity then a rescaled NN converges weakly to an α\alphaα-Stable stochastic process. As a difference with respect to the Gaussian setting, our result shows that the choice of the activation function affects the scaling of the NN, that is: to achieve the infinitely wide α\alphaα-Stable process, the ReLU activation requires an additional logarithmic term in the scaling with respect to sub-linear activations. Then, we study the large-width training dynamics of α\alphaα-Stable ReLU-NNs, characterizing the infinitely wide dynamics in terms of a random kernel, referred to as the α\alphaα-Stable NTK, and showing that, for a sufficiently large width, the gradient descent achieves zero training error at a linear rate. The randomness of the α\alphaα-Stable NTK is a further difference with respect to the Gaussian setting, that is: within the α\alphaα-Stable setting, the randomness of the NN at initialization does not vanish in the large-width regime of the training.

View on arXiv
Comments on this paper