Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on
Transformers, but Sign Descent Might Be

Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be

27 April 2023

Frederik Kunstner

Mark W. Schmidt

Papers citing "Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be"

11 / 11 papers shown

Title
Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism Tim Tsz-Kit Lau Weijian Li Chenwei Xu Han Liu Mladen Kolar 55 0 0 30 Dec 2024
Deconstructing What Makes a Good Optimizer for Language Models Rosie Zhao Depen Morwani David Brandfonbrener Nikhil Vyas Sham Kakade 39 17 0 10 Jul 2024
Does SGD really happen in tiny subspaces? Minhak Song Kwangjun Ahn Chulhee Yun 44 4 1 25 May 2024
Dynamic Anisotropic Smoothing for Noisy Derivative-Free Optimization S. Reifenstein T. Leleu Yoshihisa Yamamoto 32 1 0 02 May 2024
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models Frederik Kunstner Robin Yadav Alan Milligan Mark Schmidt Alberto Bietti 18 26 0 29 Feb 2024
Implicit Bias and Fast Convergence Rates for Self-attention Bhavya Vasudeva Puneesh Deora Christos Thrampoulidis 19 13 0 08 Feb 2024
On Convergence of Adam for Stochastic Optimization under Relaxed Assumptions Yusu Hong Junhong Lin 38 10 0 06 Feb 2024
Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts Lizhang Chen Bo Liu Kaizhao Liang Qian Liu ODL 11 15 0 09 Oct 2023
Stochastic Training is Not Necessary for Generalization Jonas Geiping Micah Goldblum Phillip E. Pope Michael Moeller Tom Goldstein 79 72 0 29 Sep 2021
A new regret analysis for Adam-type algorithms Ahmet Alacaoglu Yura Malitsky P. Mertikopoulos V. Cevher ODL 26 41 0 21 Mar 2020
A Simple Convergence Proof of Adam and Adagrad Alexandre Défossez Léon Bottou Francis R. Bach Nicolas Usunier 56 143 0 05 Mar 2020