v1v2v3 (latest)

Adam Exploits $\ell_\infty$ -geometry of Loss Landscape via Coordinate-wise Adaptivity

International Conference on Learning Representations (ICLR), 2024

10 October 2024

Shuo Xie

Mohamad Amin Mohamadi

Zhiyuan Li

ArXiv (abs)PDF HTML

Main:13 Pages

5 Figures

Bibliography:3 Pages

5 Tables

Appendix:15 Pages

Abstract

Adam outperforms SGD when training language models. Yet this advantage is not well-understood theoretically -- previous convergence analysis for Adam and SGD mainly focuses on the number of steps $T$ and is already minimax-optimal in non-convex cases, which are both $\widetilde{O}(T^{-1/4})$ . In this work, we argue that the exploitation of nice $\ell_\infty$ -geometry is the key advantage of Adam over SGD. More specifically, we give a new convergence analysis for Adam under novel assumptions that loss is smooth under $\ell_\infty$ -geometry rather than the more common $\ell_2$ -geometry, which yields a much better empirical smoothness constant for GPT-2 and ResNet models. Our experiments confirm that Adam performs much worse when the favorable $\ell_\infty$ -geometry is changed while SGD provably remains unaffected. We also extend the convergence analysis to blockwise Adam under novel blockwise smoothness assumptions.

View on arXiv

Comments on this paper

Adam Exploits ℓ∞\ell_\inftyℓ∞​-geometry of Loss Landscape via Coordinate-wise Adaptivity

Adam Exploits $\ell_\infty$ -geometry of Loss Landscape via Coordinate-wise Adaptivity