ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2110.06914
69
98

What Happens after SGD Reaches Zero Loss? --A Mathematical Framework

13 October 2021
Zhiyuan Li
Tianhao Wang
Sanjeev Arora
    MLT
ArXivPDFHTML
Abstract

Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function LLL can form a manifold. Intuitively, with a sufficiently small learning rate η\etaη, SGD tracks Gradient Descent (GD) until it gets close to such manifold, where the gradient noise prevents further convergence. In such a regime, Blanc et al. (2020) proved that SGD with label noise locally decreases a regularizer-like term, the sharpness of loss, tr[∇2L]\mathrm{tr}[\nabla^2 L]tr[∇2L]. The current paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991). It allows in principle a complete characterization for the regularization effect of SGD around such manifold -- i.e., the "implicit bias" -- using a stochastic differential equation (SDE) describing the limiting dynamics of the parameters, which is determined jointly by the loss function and the noise covariance. This yields some new results: (1) a global analysis of the implicit bias valid for η−2\eta^{-2}η−2 steps, in contrast to the local analysis of Blanc et al. (2020) that is only valid for η−1.6\eta^{-1.6}η−1.6 steps and (2) allowing arbitrary noise covariance. As an application, we show with arbitrary large initialization, label noise SGD can always escape the kernel regime and only requires O(κln⁡d)O(\kappa\ln d)O(κlnd) samples for learning an κ\kappaκ-sparse overparametrized linear model in Rd\mathbb{R}^dRd (Woodworth et al., 2020), while GD initialized in the kernel regime requires Ω(d)\Omega(d)Ω(d) samples. This upper bound is minimax optimal and improves the previous O~(κ2)\tilde{O}(\kappa^2)O~(κ2) upper bound (HaoChen et al., 2020).

View on arXiv
Comments on this paper