ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2305.12475
19
15

Two Sides of One Coin: the Limits of Untuned SGD and the Power of Adaptive Methods

21 May 2023
Junchi Yang
Xiang Li
Ilyas Fatkhullin
Niao He
ArXivPDFHTML
Abstract

The classical analysis of Stochastic Gradient Descent (SGD) with polynomially decaying stepsize ηt=η/t\eta_t = \eta/\sqrt{t}ηt​=η/t​ relies on well-tuned η\etaη depending on problem parameters such as Lipschitz smoothness constant, which is often unknown in practice. In this work, we prove that SGD with arbitrary η>0\eta > 0η>0, referred to as untuned SGD, still attains an order-optimal convergence rate O~(T−1/4)\widetilde{O}(T^{-1/4})O(T−1/4) in terms of gradient norm for minimizing smooth objectives. Unfortunately, it comes at the expense of a catastrophic exponential dependence on the smoothness constant, which we show is unavoidable for this scheme even in the noiseless setting. We then examine three families of adaptive methods \unicodex2013\unicode{x2013}\unicodex2013 Normalized SGD (NSGD), AMSGrad, and AdaGrad \unicodex2013\unicode{x2013}\unicodex2013 unveiling their power in preventing such exponential dependency in the absence of information about the smoothness parameter and boundedness of stochastic gradients. Our results provide theoretical justification for the advantage of adaptive methods over untuned SGD in alleviating the issue with large gradients.

View on arXiv
Comments on this paper