17
73

Universality of empirical risk minimization

Abstract

Consider supervised learning from i.i.d. samples {xi,yi}in\{{\boldsymbol x}_i,y_i\}_{i\le n} where xiRp{\boldsymbol x}_i \in\mathbb{R}^p are feature vectors and yR{y} \in \mathbb{R} are labels. We study empirical risk minimization over a class of functions that are parameterized by k=O(1)\mathsf{k} = O(1) vectors θ1,...,θkRp{\boldsymbol \theta}_1, . . . , {\boldsymbol \theta}_{\mathsf k} \in \mathbb{R}^p , and prove universality results both for the training and test error. Namely, under the proportional asymptotics n,pn,p\to\infty, with n/p=Θ(1)n/p = \Theta(1), we prove that the training error depends on the random features distribution only through its covariance structure. Further, we prove that the minimum test error over near-empirical risk minimizers enjoys similar universality properties. In particular, the asymptotics of these quantities can be computed -to leading order- under a simpler model in which the feature vectors xi{\boldsymbol x}_i are replaced by Gaussian vectors gi{\boldsymbol g}_i with the same covariance. Earlier universality results were limited to strongly convex learning procedures, or to feature vectors xi{\boldsymbol x}_i with independent entries. Our results do not make any of these assumptions. Our assumptions are general enough to include feature vectors xi{\boldsymbol x}_i that are produced by randomized featurization maps. In particular we explicitly check the assumptions for certain random features models (computing the output of a one-layer neural network with random weights) and neural tangent models (first-order Taylor approximation of two-layer networks).

View on arXiv
Comments on this paper