335
v1v2 (latest)

Ultra-fast feature learning for the training of two-layer neural networks in the two-timescale regime

Main:38 Pages
17 Figures
Bibliography:6 Pages
Appendix:7 Pages
Abstract

We study the convergence of gradient methods for the training of mean-field single-hidden-layer neural networks with square loss. For this high-dimensional and non-convex optimization problem, most known convergence results are either qualitative or rely on a neural tangent kernel analysis where nonlinear representations of the data are fixed. Using that this problem belongs to the class of separable nonlinear least squares problems, we consider here a Variable Projection (VarPro) or two-timescale learning algorithm, thereby eliminating the linear variables and reducing the learning problem to the training of nonlinear features. In a teacher-student scenario, we show such a strategy enables provable convergence rates for the sampling of a teacher feature distribution. Precisely, in the limit where the regularization strength vanishes, we show that the dynamic of the feature distribution corresponds to a weighted ultra-fast diffusion equation. Recent results on the asymptotic behavior of such PDEs then give quantitative guarantees for the convergence of the learned feature distribution.

View on arXiv
Comments on this paper