92
v1v2 (latest)

Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit

Main:17 Pages
3 Figures
Bibliography:5 Pages
Appendix:63 Pages
Abstract

In deep learning, a central issue is to understand how neural networks efficiently learn high-dimensional features. To this end, we explore the gradient descent learning of a general Gaussian Multi-index model f(x)=g(Ux)f(\boldsymbol{x})=g(\boldsymbol{U}\boldsymbol{x}) with hidden subspace URr×d\boldsymbol{U}\in \mathbb{R}^{r\times d}, which is the canonical setup to study representation learning. We prove that under generic non-degenerate assumptions on the link function, a standard two-layer neural network trained via layer-wise gradient descent can agnostically learn the target with od(1)o_d(1) test error using O~(d)\widetilde{\mathcal{O}}(d) samples and O~(d2)\widetilde{\mathcal{O}}(d^2) time. The sample and time complexity both align with the information-theoretic limit up to leading order and are therefore optimal. During the first stage of gradient descent learning, the proof proceeds via showing that the inner weights can perform a power-iteration process. This process implicitly mimics a spectral start for the whole span of the hidden subspace and eventually eliminates finite-sample noise and recovers this span. It surprisingly indicates that optimal results can only be achieved if the first layer is trained for more than O(1)\mathcal{O}(1) steps. This work demonstrates the ability of neural networks to effectively learn hierarchical functions with respect to both sample and time efficiency.

View on arXiv
Comments on this paper