The Sample Complexity of Gradient Descent in Stochastic Convex Optimization

7 April 2024

Roi Livni

MLT

ArXiv (abs)PDF HTML Github

Main:14 Pages

1 Figures

Bibliography:2 Pages

Appendix:5 Pages

Abstract

We analyze the sample complexity of full-batch Gradient Descent (GD) in the setup of non-smooth Stochastic Convex Optimization. We show that the generalization error of GD, with (minmax) optimal choice of hyper-parameters, can be $\tilde \Theta(d/m + 1/\sqrt{m})$ , where $d$ is the dimension and $m$ is the sample size. This matches the sample complexity of \emph{worst-case} empirical risk minimizers. That means that, in contrast with other algorithms, GD has no advantage over naive ERMs. Our bound follows from a new generalization bound that depends on both the dimension as well as the learning rate and number of iterations. Our bound also shows that, for general hyper-parameters, when the dimension is strictly larger than number of samples, $T=\Omega(1/\epsilon^4)$ iterations are necessary to avoid overfitting. This resolves an open problem by \citet*{schliserman2024dimension, amir2021sgd}, and improves over previous lower bounds that demonstrated that the sample size must be at least square root of the dimension.

View on arXiv

Comments on this paper