ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.05626
30
1

On the Impacts of the Random Initialization in the Neural Tangent Kernel Theory

8 October 2024
Guhan Chen
Yicheng Li
Qian Lin
    AAML
ArXivPDFHTML
Abstract

This paper aims to discuss the impact of random initialization of neural networks in the neural tangent kernel (NTK) theory, which is ignored by most recent works in the NTK theory. It is well known that as the network's width tends to infinity, the neural network with random initialization converges to a Gaussian process fGPf^{\mathrm{GP}}fGP, which takes values in L2(X)L^{2}(\mathcal{X})L2(X), where X\mathcal{X}X is the domain of the data. In contrast, to adopt the traditional theory of kernel regression, most recent works introduced a special mirrored architecture and a mirrored (random) initialization to ensure the network's output is identically zero at initialization. Therefore, it remains a question whether the conventional setting and mirrored initialization would make wide neural networks exhibit different generalization capabilities. In this paper, we first show that the training dynamics of the gradient flow of neural networks with random initialization converge uniformly to that of the corresponding NTK regression with random initialization fGPf^{\mathrm{GP}}fGP. We then show that P(fGP∈[HNT]s)=1\mathbf{P}(f^{\mathrm{GP}} \in [\mathcal{H}^{\mathrm{NT}}]^{s}) = 1P(fGP∈[HNT]s)=1 for any s<3d+1s < \frac{3}{d+1}s<d+13​ and P(fGP∈[HNT]s)=0\mathbf{P}(f^{\mathrm{GP}} \in [\mathcal{H}^{\mathrm{NT}}]^{s}) = 0P(fGP∈[HNT]s)=0 for any s≥3d+1s \geq \frac{3}{d+1}s≥d+13​, where [HNT]s[\mathcal{H}^{\mathrm{NT}}]^{s}[HNT]s is the real interpolation space of the RKHS HNT\mathcal{H}^{\mathrm{NT}}HNT associated with the NTK. Consequently, the generalization error of the wide neural network trained by gradient descent is Ω(n−3d+3)\Omega(n^{-\frac{3}{d+3}})Ω(n−d+33​), and it still suffers from the curse of dimensionality. On one hand, the result highlights the benefits of mirror initialization. On the other hand, it implies that NTK theory may not fully explain the superior performance of neural networks.

View on arXiv
Comments on this paper