Rapid Overfitting of Multi-Pass Stochastic Gradient Descent in Stochastic Convex Optimization

13 May 2025

Abstract

We study the out-of-sample performance of multi-pass stochastic gradient descent (SGD) in the fundamental stochastic convex optimization (SCO) model. While one-pass SGD is known to achieve an optimal $\Theta(1/\sqrt{n})$ excess population loss given a sample of size $n$ , much less is understood about the multi-pass version of the algorithm which is widely used in practice. Somewhat surprisingly, we show that in the general non-smooth case of SCO, just a few epochs of SGD can already hurt its out-of-sample performance significantly and lead to overfitting. In particular, using a step size $\eta = \Theta(1/\sqrt{n})$ , which gives the optimal rate after one pass, can lead to population loss as large as $\Omega(1)$ after just one additional pass. More generally, we show that the population loss from the second pass onward is of the order $\Theta(1/(\eta T) + \eta \sqrt{T})$ , where $T$ is the total number of steps. These results reveal a certain phase-transition in the out-of-sample behavior of SGD after the first epoch, as well as a sharp separation between the rates of overfitting in the smooth and non-smooth cases of SCO. Additionally, we extend our results to with-replacement SGD, proving that the same asymptotic bounds hold after $O(n \log n)$ steps. Finally, we also prove a lower bound of $\Omega(\eta \sqrt{n})$ on the generalization gap of one-pass SGD in dimension $d = \smash{\widetilde O}(n)$ , improving on recent results of Koren et al.(2022) and Schliserman et al.(2024).

View on arXiv

@article{vansover-hager2025_2505.08306,
  title={ Rapid Overfitting of Multi-Pass Stochastic Gradient Descent in Stochastic Convex Optimization },
  author={ Shira Vansover-Hager and Tomer Koren and Roi Livni },
  journal={arXiv preprint arXiv:2505.08306},
  year={ 2025 }
}

Comments on this paper