Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients

22 May 2017

Papers citing "Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients"

30 / 30 papers shown

Title
Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers Akiyoshi Tomihari Issei Sato ODL 61 0 0 31 Jan 2025
Distributed Sign Momentum with Local Steps for Training Transformers Shuhua Yu Ding Zhou Cong Xie An Xu Zhi-Li Zhang Xin Liu S. Kar 66 0 0 26 Nov 2024
Deconstructing What Makes a Good Optimizer for Language Models Rosie Zhao Depen Morwani David Brandfonbrener Nikhil Vyas Sham Kakade 44 17 0 10 Jul 2024
$Implicit Bias of AdamW: $\ell_\infty$ Norm Constrained Optimization$ Implicit Bias of AdamW: $\ell_\infty$ Norm Constrained Optimization Shuo Xie Zhiyuan Li OffRL 39 12 0 05 Apr 2024
SignSGD with Federated Voting Chanho Park H. Vincent Poor Namyoon Lee FedML 40 1 0 25 Mar 2024
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models Frederik Kunstner Robin Yadav Alan Milligan Mark Schmidt Alberto Bietti 36 26 0 29 Feb 2024
Convergence of Sign-based Random Reshuffling Algorithms for Nonconvex Optimization Zhen Qin Zhishuai Liu Pan Xu 18 1 0 24 Oct 2023
Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts Lizhang Chen Bo Liu Kaizhao Liang Qian Liu ODL 19 15 0 09 Oct 2023
Understanding the robustness difference between stochastic gradient descent and adaptive gradient methods A. Ma Yangchen Pan Amir-massoud Farahmand AAML 25 5 0 13 Aug 2023
Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training Hong Liu Zhiyuan Li David Leo Wright Hall Percy Liang Tengyu Ma VLM 32 128 0 23 May 2023
Two Sides of One Coin: the Limits of Untuned SGD and the Power of Adaptive Methods Junchi Yang Xiang Li Ilyas Fatkhullin Niao He 34 15 0 21 May 2023
Symbolic Discovery of Optimization Algorithms Xiangning Chen Chen Liang Da Huang Esteban Real Kaiyuan Wang ... Xuanyi Dong Thang Luong Cho-Jui Hsieh Yifeng Lu Quoc V. Le 55 350 0 13 Feb 2023
A Deep Learning Approach to Generating Photospheric Vector Magnetograms of Solar Active Regions for SOHO/MDI Using SDO/HMI and BBSO Data Haodi Jiang Qin Li Zhihang Hu Nian Liu Yasser Abduallah ... Genwei Zhang Yan Xu Wynne Hsu J. T. Wang Haimin Wang 32 6 0 04 Nov 2022
An Empirical Evaluation of Zeroth-Order Optimization Methods on AI-driven Molecule Optimization Elvin Lo Pin-Yu Chen 26 0 0 27 Oct 2022
Momentum Diminishes the Effect of Spectral Bias in Physics-Informed Neural Networks G. Farhani Alexander Kazachek Boyu Wang 19 6 0 29 Jun 2022
Logit Normalization for Long-tail Object Detection Liang Zhao Yao Teng Limin Wang 26 10 0 31 Mar 2022
A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Range Guoqiang Zhang Kenta Niwa W. Kleijn ODL 13 2 0 24 Mar 2022
Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam Yucheng Lu Conglong Li Minjia Zhang Christopher De Sa Yuxiong He OffRL AI4CE 24 20 0 12 Feb 2022
Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization Difan Zou Yuan Cao Yuanzhi Li Quanquan Gu MLT AI4CE 44 38 0 25 Aug 2021
GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training Chen Zhu Renkun Ni Zheng Xu Kezhi Kong W. R. Huang Tom Goldstein ODL 41 53 0 16 Feb 2021
A Qualitative Study of the Dynamic Behavior for Adaptive Gradient Algorithms Chao Ma Lei Wu E. Weinan ODL 11 23 0 14 Sep 2020
AdaScale SGD: A User-Friendly Algorithm for Distributed Training Tyler B. Johnson Pulkit Agrawal Haijie Gu Carlos Guestrin ODL 21 37 0 09 Jul 2020
An Analysis of Constant Step Size SGD in the Non-convex Regime: Asymptotic Normality and Bias Lu Yu Krishnakumar Balasubramanian S. Volgushev Murat A. Erdogdu 32 50 0 14 Jun 2020
LaProp: Separating Momentum and Adaptivity in Adam Liu Ziyin Zhikang T.Wang Masahito Ueda ODL 6 18 0 12 Feb 2020
Limitations of the Empirical Fisher Approximation for Natural Gradient Descent Frederik Kunstner Lukas Balles Philipp Hennig 21 207 0 29 May 2019
A Sufficient Condition for Convergences of Adam and RMSProp Fangyu Zou Li Shen Zequn Jie Weizhong Zhang Wei Liu 19 362 0 23 Nov 2018
signSGD with Majority Vote is Communication Efficient And Fault Tolerant Jeremy Bernstein Jiawei Zhao Kamyar Azizzadenesheli Anima Anandkumar FedML 23 46 0 11 Oct 2018
signSGD: Compressed Optimisation for Non-Convex Problems Jeremy Bernstein Yu-Xiang Wang Kamyar Azizzadenesheli Anima Anandkumar FedML ODL 35 1,019 0 13 Feb 2018
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima N. Keskar Dheevatsa Mudigere J. Nocedal M. Smelyanskiy P. T. P. Tang ODL 281 2,889 0 15 Sep 2016
Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition Hamed Karimi J. Nutini Mark W. Schmidt 139 1,199 0 16 Aug 2016