65
v1v2v3v4 (latest)

Sharp Generalization for Nonparametric Regression in Interpolation Space by Over-Parameterized Neural Networks Trained with Preconditioned Gradient Descent and Early Stopping

Main:48 Pages
3 Figures
Bibliography:6 Pages
1 Tables
Abstract

We study nonparametric regression using an over-parameterized two-layer neural networks trained with algorithmic guarantees in this paper. We consider the setting where the training features are drawn uniformly from the unit sphere in \RRd\RR^d, and the target function lies in an interpolation space commonly studied in statistical learning theory. We demonstrate that training the neural network with a novel Preconditioned Gradient Descent (PGD) algorithm, equipped with early stopping, achieves a sharp regression rate of \cO(n2αs2αs+1)\cO(n^{-\frac{2\alpha s'}{2\alpha s'+1}}) when the target function is in the interpolation space \bth\cHKs\bth{\cH_K}^{s'} with s3s' \ge 3. This rate is even sharper than the currently known nearly-optimal rate of \cO(n2αs2αs+1)log2(1/δ)\cO(n^{-\frac{2\alpha s'}{2\alpha s'+1}})\log^2(1/\delta)~\citep{Li2024-edr-general-domain}, where nn is the size of the training data and δ(0,1)\delta \in (0,1) is a small probability. This rate is also sharper than the standard kernel regression rate of \cO(n2α2α+1)\cO(n^{-\frac{2\alpha}{2\alpha+1}}) obtained under the regular Neural Tangent Kernel (NTK) regime when training the neural network with the vanilla gradient descent (GD), where 2α=d/(d1)2\alpha = d/(d-1). Our analysis is based on two key technical contributions. First, we present a principled decomposition of the network output at each PGD step into a function in the reproducing kernel Hilbert space (RKHS) of a newly induced integral kernel, and a residual function with small LL^{\infty}-norm. Second, leveraging this decomposition, we apply local Rademacher complexity theory to tightly control the complexity of the function class comprising all the neural network functions obtained in the PGD iterates. Our results further suggest that PGD enables the neural network to escape the linear NTK regime and achieve improved generalization.

View on arXiv
Comments on this paper