ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2406.04592
43
6

Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions

7 June 2024
Devyani Maladkar
Ruichen Jiang
Aryan Mokhtari
ArXivPDFHTML
Abstract

Adaptive gradient methods are arguably the most successful optimization algorithms for neural network training. While it is well-known that adaptive gradient methods can achieve better dimensional dependence than stochastic gradient descent (SGD) under favorable geometry for stochastic convex optimization, the theoretical justification for their success in stochastic non-convex optimization remains elusive. In this paper, we aim to close this gap by analyzing the convergence rates of AdaGrad measured by the ℓ1\ell_1ℓ1​-norm of the gradient. Specifically, when the objective has LLL-Lipschitz gradient and the stochastic gradient variance is bounded by σ2\sigma^2σ2, we prove a worst-case convergence rate of O~(dLT+dσT1/4)\tilde{\mathcal{O}}(\frac{\sqrt{d}L}{\sqrt{T}} + \frac{\sqrt{d} \sigma}{T^{1/4}})O~(T​d​L​+T1/4d​σ​), where ddd is the dimension of the problem.We also present a lower bound of Ω(dT){\Omega}(\frac{\sqrt{d}}{\sqrt{T}})Ω(T​d​​) for minimizing the gradient ℓ1\ell_1ℓ1​-norm in the deterministic setting, showing the tightness of our upper bound in the noiseless case. Moreover, under more fine-grained assumptions on the smoothness structure of the objective and the gradient noise and under favorable gradient ℓ1/ℓ2\ell_1/\ell_2ℓ1​/ℓ2​ geometry, we show that AdaGrad can potentially shave a factor of d\sqrt{d}d​ compared to SGD. To the best of our knowledge, this is the first result for adaptive gradient methods that demonstrates a provable gain over SGD in the non-convex setting.

View on arXiv
Comments on this paper