19
33

Smoothing the Landscape Boosts the Signal for SGD: Optimal Sample Complexity for Learning Single Index Models

Abstract

We focus on the task of learning a single index model σ(wx)\sigma(w^\star \cdot x) with respect to the isotropic Gaussian distribution in dd dimensions. Prior work has shown that the sample complexity of learning ww^\star is governed by the information exponent kk^\star of the link function σ\sigma, which is defined as the index of the first nonzero Hermite coefficient of σ\sigma. Ben Arous et al. (2021) showed that ndk1n \gtrsim d^{k^\star-1} samples suffice for learning ww^\star and that this is tight for online SGD. However, the CSQ lower bound for gradient based methods only shows that ndk/2n \gtrsim d^{k^\star/2} samples are necessary. In this work, we close the gap between the upper and lower bounds by showing that online SGD on a smoothed loss learns ww^\star with ndk/2n \gtrsim d^{k^\star/2} samples. We also draw connections to statistical analyses of tensor PCA and to the implicit regularization effects of minibatch SGD on empirical losses.

View on arXiv
Comments on this paper