113

Provable Benefit of Sign Descent: A Minimal Model Under Heavy-Tailed Class Imbalance

Main:6 Pages
4 Figures
Bibliography:2 Pages
Appendix:7 Pages
Abstract

Adaptive optimization methods (such as Adam) play a major role in LLM pretraining, significantly outperforming Gradient Descent (GD). Recent studies have proposed new smoothness assumptions on the loss function to explain the advantages of adaptive algorithms with structured preconditioners, e.g., coordinate-wise or layer-wise, and steepest descent methods w.r.t. non-euclidean norms, e.g., \ell_\infty norm or spectral norm, over GD. However, it remains unclear how these smoothness assumptions manifest in language modelling tasks. In this work, we aim to analyze the benefit of \ell_\infty-norm descent (a.k.a. sign descent) directly from properties of the data distribution, namely, heavy-tailed class imbalance. We propose a minimal yet representative setting of next-token prediction, where we can provably show faster convergence of coordinate-wise algorithms such as Sign descent (steepest descent w.r.t. \ell_\infty norm) over normalized GD (steepest descent w.r.t. to 2\ell_2 norm) in the presence of heavy tail class imbalance.

View on arXiv
Comments on this paper