121
v1v2v3 (latest)

Adam Exploits \ell_\infty-geometry of Loss Landscape via Coordinate-wise Adaptivity

International Conference on Learning Representations (ICLR), 2024
Main:13 Pages
5 Figures
Bibliography:3 Pages
5 Tables
Appendix:15 Pages
Abstract

Adam outperforms SGD when training language models. Yet this advantage is not well-understood theoretically -- previous convergence analysis for Adam and SGD mainly focuses on the number of steps TT and is already minimax-optimal in non-convex cases, which are both O~(T1/4)\widetilde{O}(T^{-1/4}). In this work, we argue that the exploitation of nice \ell_\infty-geometry is the key advantage of Adam over SGD. More specifically, we give a new convergence analysis for Adam under novel assumptions that loss is smooth under \ell_\infty-geometry rather than the more common 2\ell_2-geometry, which yields a much better empirical smoothness constant for GPT-2 and ResNet models. Our experiments confirm that Adam performs much worse when the favorable \ell_\infty-geometry is changed while SGD provably remains unaffected. We also extend the convergence analysis to blockwise Adam under novel blockwise smoothness assumptions.

View on arXiv
Comments on this paper