12
0

On the O(dK1/4)O(\frac{\sqrt{d}}{K^{1/4}}) Convergence Rate of AdamW Measured by 1\ell_1 Norm

Abstract

As the default optimizer for training large language models, AdamW has achieved remarkable success in deep learning. However, its convergence behavior is not theoretically well-understood. This paper establishes the convergence rate 1Kk=1KE[f(xk)1]O(dCK1/4)\frac{1}{K}\sum_{k=1}^KE\left[\|\nabla f(x^k)\|_1\right]\leq O(\frac{\sqrt{d}C}{K^{1/4}}) for AdamW measured by 1\ell_1 norm, where KK represents the iteration number, dd denotes the model dimension, and CC matches the constant in the optimal convergence rate of SGD. Theoretically, we have E[f(x)1]2dπE[f(x)2]E\left[\|\nabla f(x)\|_1\right]\geq\sqrt{\frac{2d}{\pi}}E\left[\|\nabla f(x)\|_2\right] when each element of f(x)\nabla f(x) is generated from Gaussian distribution N(0,1)\mathcal N(0,1). Empirically, our experimental results on real-world deep learning tasks reveal f(x)1=Θ(d)f(x)2\|\nabla f(x)\|_1=\varTheta(\sqrt{d})\|\nabla f(x)\|_2. Both support that our convergence rate can be considered to be analogous to the optimal 1Kk=1KE[f(xk)2]O(CK1/4)\frac{1}{K}\sum_{k=1}^KE\left[\|\nabla f(x^k)\|_2\right]\leq O(\frac{C}{K^{1/4}}) convergence rate of SGD.

View on arXiv
@article{li2025_2505.11840,
  title={ On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm },
  author={ Huan Li and Yiming Dong and Zhouchen Lin },
  journal={arXiv preprint arXiv:2505.11840},
  year={ 2025 }
}
Comments on this paper