9

The Spectral Dimension of NTKs is Constant: A Theory of Implicit Regularization, Finite-Width Stability, and Scalable Estimation

Abstract

Modern deep networks are heavily overparameterized yet often generalize well, suggesting a form of low intrinsic complexity not reflected by parameter counts. We study this complexity at initialization through the effective rank of the Neural Tangent Kernel (NTK) Gram matrix, reff(K)=(tr(K))2/KF2r_{\text{eff}}(K) = (\text{tr}(K))^2/\|K\|_F^2. For i.i.d. data and the infinite-width NTK kk, we prove a constant-limit law limnE[reff(Kn)]=E[k(x,x)]2/E[k(x,x)2]=:r\lim_{n\to\infty} \mathbb{E}[r_{\text{eff}}(K_n)] = \mathbb{E}[k(x, x)]^2 / \mathbb{E}[k(x, x')^2] =: r_\infty, with sub-Gaussian concentration. We further establish finite-width stability: if the finite-width NTK deviates in operator norm by Op(m1/2)O_p(m^{-1/2}) (width mm), then reffr_{\text{eff}} changes by Op(m1/2)O_p(m^{-1/2}). We design a scalable estimator using random output probes and a CountSketch of parameter Jacobians and prove conditional unbiasedness and consistency with explicit variance bounds. On CIFAR-10 with ResNet-20/56 (widths 16/32) across n{103,5×103,104,2.5×104,5×104}n \in \{10^3, 5\times10^3, 10^4, 2.5\times10^4, 5\times10^4\}, we observe reff1.01.3r_{\text{eff}} \approx 1.0\text{--}1.3 and slopes 0\approx 0 in nn, consistent with the theory, and the kernel-moment prediction closely matches fitted constants.

View on arXiv
Main:7 Pages
1 Figures
Bibliography:1 Pages
2 Tables
Comments on this paper