316

Generalization of ERM in Stochastic Convex Optimization: The Dimension Strikes Back

Neural Information Processing Systems (NeurIPS), 2016
Abstract

In stochastic convex optimization the goal is to minimize a convex function F(x)EfD[f(x)]F(x) \doteq {\mathbf E}_{{\mathbf f}\sim D}[{\mathbf f}(x)] over a convex set KRd\cal K \subset {\mathbb R}^d where DD is some unknown distribution and each f()f(\cdot) in the support of DD is convex over K\cal K. The optimization is commonly based on i.i.d.~samples f1,f2,,fnf^1,f^2,\ldots,f^n from DD. A standard approach to such problems is empirical risk minimization (ERM) that optimizes FS(x)1ninfi(x)F_S(x) \doteq \frac{1}{n}\sum_{i\leq n} f^i(x). Here we consider the question of how many samples are necessary for ERM to succeed and the closely related question of uniform convergence of FSF_S to FF over K\cal K. We demonstrate that in the standard p/q\ell_p/\ell_q setting of Lipschitz-bounded functions over a K\cal K of bounded radius, ERM requires sample size that scales linearly with the dimension dd. This nearly matches standard upper bounds and improves on Ω(logd)\Omega(\log d) dependence proved for 2/2\ell_2/\ell_2 setting by Shalev-Shwartz et al. (2009). In stark contrast, these problems can be solved using dimension-independent number of samples for 2/2\ell_2/\ell_2 setting and logd\log d dependence for 1/\ell_1/\ell_\infty setting using other approaches. We also demonstrate that for a more general class of range-bounded (but not Lipschitz-bounded) stochastic convex programs an even stronger gap appears already in dimension 2.

View on arXiv
Comments on this paper