On the Convergence Rate of AdamW Measured by Norm
As the default optimizer for training large language models, AdamW has achieved remarkable success in deep learning. However, its convergence behavior is not theoretically well-understood. This paper establishes the convergence rate for AdamW measured by norm, where represents the iteration number, denotes the model dimension, and matches the constant in the optimal convergence rate of SGD. Theoretically, we have for any high-dimensional vector and when each element of is generated from Gaussian distribution . Empirically, our experimental results on real-world deep learning tasks reveal . Both support that our convergence rate can be considered to be analogous to the optimal convergence rate of SGD in the ideal case. We also extend our result to NAdamW, an AdamW variant that employs a double-momentum mechanism, and demonstrate that it maintains the same convergence rate.
View on arXiv