Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1705.07774
Cited By
Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients
22 May 2017
Lukas Balles
Philipp Hennig
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients"
30 / 30 papers shown
Title
Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers
Akiyoshi Tomihari
Issei Sato
ODL
61
0
0
31 Jan 2025
Distributed Sign Momentum with Local Steps for Training Transformers
Shuhua Yu
Ding Zhou
Cong Xie
An Xu
Zhi-Li Zhang
Xin Liu
S. Kar
66
0
0
26 Nov 2024
Deconstructing What Makes a Good Optimizer for Language Models
Rosie Zhao
Depen Morwani
David Brandfonbrener
Nikhil Vyas
Sham Kakade
44
17
0
10 Jul 2024
Implicit Bias of AdamW:
ℓ
∞
\ell_\infty
ℓ
∞
Norm Constrained Optimization
Shuo Xie
Zhiyuan Li
OffRL
39
12
0
05 Apr 2024
SignSGD with Federated Voting
Chanho Park
H. Vincent Poor
Namyoon Lee
FedML
40
1
0
25 Mar 2024
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models
Frederik Kunstner
Robin Yadav
Alan Milligan
Mark Schmidt
Alberto Bietti
36
26
0
29 Feb 2024
Convergence of Sign-based Random Reshuffling Algorithms for Nonconvex Optimization
Zhen Qin
Zhishuai Liu
Pan Xu
18
1
0
24 Oct 2023
Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts
Lizhang Chen
Bo Liu
Kaizhao Liang
Qian Liu
ODL
19
15
0
09 Oct 2023
Understanding the robustness difference between stochastic gradient descent and adaptive gradient methods
A. Ma
Yangchen Pan
Amir-massoud Farahmand
AAML
25
5
0
13 Aug 2023
Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training
Hong Liu
Zhiyuan Li
David Leo Wright Hall
Percy Liang
Tengyu Ma
VLM
32
128
0
23 May 2023
Two Sides of One Coin: the Limits of Untuned SGD and the Power of Adaptive Methods
Junchi Yang
Xiang Li
Ilyas Fatkhullin
Niao He
34
15
0
21 May 2023
Symbolic Discovery of Optimization Algorithms
Xiangning Chen
Chen Liang
Da Huang
Esteban Real
Kaiyuan Wang
...
Xuanyi Dong
Thang Luong
Cho-Jui Hsieh
Yifeng Lu
Quoc V. Le
55
350
0
13 Feb 2023
A Deep Learning Approach to Generating Photospheric Vector Magnetograms of Solar Active Regions for SOHO/MDI Using SDO/HMI and BBSO Data
Haodi Jiang
Qin Li
Zhihang Hu
Nian Liu
Yasser Abduallah
...
Genwei Zhang
Yan Xu
Wynne Hsu
J. T. Wang
Haimin Wang
32
6
0
04 Nov 2022
An Empirical Evaluation of Zeroth-Order Optimization Methods on AI-driven Molecule Optimization
Elvin Lo
Pin-Yu Chen
26
0
0
27 Oct 2022
Momentum Diminishes the Effect of Spectral Bias in Physics-Informed Neural Networks
G. Farhani
Alexander Kazachek
Boyu Wang
19
6
0
29 Jun 2022
Logit Normalization for Long-tail Object Detection
Liang Zhao
Yao Teng
Limin Wang
26
10
0
31 Mar 2022
A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Range
Guoqiang Zhang
Kenta Niwa
W. Kleijn
ODL
13
2
0
24 Mar 2022
Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam
Yucheng Lu
Conglong Li
Minjia Zhang
Christopher De Sa
Yuxiong He
OffRL
AI4CE
24
20
0
12 Feb 2022
Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization
Difan Zou
Yuan Cao
Yuanzhi Li
Quanquan Gu
MLT
AI4CE
44
38
0
25 Aug 2021
GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training
Chen Zhu
Renkun Ni
Zheng Xu
Kezhi Kong
W. R. Huang
Tom Goldstein
ODL
41
53
0
16 Feb 2021
A Qualitative Study of the Dynamic Behavior for Adaptive Gradient Algorithms
Chao Ma
Lei Wu
E. Weinan
ODL
11
23
0
14 Sep 2020
AdaScale SGD: A User-Friendly Algorithm for Distributed Training
Tyler B. Johnson
Pulkit Agrawal
Haijie Gu
Carlos Guestrin
ODL
21
37
0
09 Jul 2020
An Analysis of Constant Step Size SGD in the Non-convex Regime: Asymptotic Normality and Bias
Lu Yu
Krishnakumar Balasubramanian
S. Volgushev
Murat A. Erdogdu
32
50
0
14 Jun 2020
LaProp: Separating Momentum and Adaptivity in Adam
Liu Ziyin
Zhikang T.Wang
Masahito Ueda
ODL
6
18
0
12 Feb 2020
Limitations of the Empirical Fisher Approximation for Natural Gradient Descent
Frederik Kunstner
Lukas Balles
Philipp Hennig
21
207
0
29 May 2019
A Sufficient Condition for Convergences of Adam and RMSProp
Fangyu Zou
Li Shen
Zequn Jie
Weizhong Zhang
Wei Liu
19
362
0
23 Nov 2018
signSGD with Majority Vote is Communication Efficient And Fault Tolerant
Jeremy Bernstein
Jiawei Zhao
Kamyar Azizzadenesheli
Anima Anandkumar
FedML
23
46
0
11 Oct 2018
signSGD: Compressed Optimisation for Non-Convex Problems
Jeremy Bernstein
Yu-Xiang Wang
Kamyar Azizzadenesheli
Anima Anandkumar
FedML
ODL
35
1,019
0
13 Feb 2018
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
N. Keskar
Dheevatsa Mudigere
J. Nocedal
M. Smelyanskiy
P. T. P. Tang
ODL
281
2,889
0
15 Sep 2016
Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition
Hamed Karimi
J. Nutini
Mark W. Schmidt
139
1,199
0
16 Aug 2016
1