488
v1v2v3 (latest)

On the O(dK1/4)O(\frac{\sqrt{d}}{K^{1/4}}) Convergence Rate of AdamW Measured by 1\ell_1 Norm

Main:19 Pages
5 Figures
Bibliography:4 Pages
Abstract

As the default optimizer for training large language models, AdamW has achieved remarkable success in deep learning. However, its convergence behavior is not theoretically well-understood. This paper establishes the convergence rate 1Kk=1KE[f(xk)1]O(dCK1/4)\frac{1}{K}\sum_{k=1}^KE\left[||\nabla f(x^k)||_1\right]\leq O(\frac{\sqrt{d}C}{K^{1/4}}) for AdamW measured by 1\ell_1 norm, where KK represents the iteration number, dd denotes the model dimension, and CC matches the constant in the optimal convergence rate of SGD. Theoretically, we have f(x)2f(x)1df(x)2||\nabla f(x)||_2\ll ||\nabla f(x)||_1\leq \sqrt{d}||\nabla f(x)||_2 for any high-dimensional vector xx and E[f(x)1]2dπE[f(x)2]E\left[||\nabla f(x)||_1\right]\geq\sqrt{\frac{2d}{\pi}}E\left[||\nabla f(x)||_2\right] when each element of f(x)\nabla f(x) is generated from Gaussian distribution N(0,1)\mathcal N(0,1). Empirically, our experimental results on real-world deep learning tasks reveal f(x)1=Θ(d)f(x)2||\nabla f(x)||_1=\varTheta(\sqrt{d})||\nabla f(x)||_2. Both support that our convergence rate can be considered to be analogous to the optimal 1Kk=1KE[f(xk)2]O(CK1/4)\frac{1}{K}\sum_{k=1}^KE\left[||\nabla f(x^k)||_2\right]\leq O(\frac{C}{K^{1/4}}) convergence rate of SGD in the ideal case. We also extend our result to NAdamW, an AdamW variant that employs a double-momentum mechanism, and demonstrate that it maintains the same convergence rate.

View on arXiv
Comments on this paper