Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2402.03496
Cited By
Can We Remove the Square-Root in Adaptive Gradient Methods? A Second-Order Perspective
5 February 2024
Wu Lin
Felix Dangel
Runa Eschenhagen
Juhan Bae
Richard E. Turner
Alireza Makhzani
ODL
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Can We Remove the Square-Root in Adaptive Gradient Methods? A Second-Order Perspective"
8 / 8 papers shown
Title
COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs
Liming Liu
Zhenghao Xu
Zixuan Zhang
Hao Kang
Zichong Li
Chen Liang
Weizhu Chen
T. Zhao
42
1
0
24 Feb 2025
Spectral-factorized Positive-definite Curvature Learning for NN Training
Wu Lin
Felix Dangel
Runa Eschenhagen
Juhan Bae
Richard E. Turner
Roger B. Grosse
34
0
0
10 Feb 2025
Position: Curvature Matrices Should Be Democratized via Linear Operators
Felix Dangel
Runa Eschenhagen
Weronika Ormaniec
Andres Fernandez
Lukas Tatzel
Agustinus Kristiadi
48
3
0
31 Jan 2025
SOAP: Improving and Stabilizing Shampoo using Adam
Nikhil Vyas
Depen Morwani
Rosie Zhao
Itai Shapira
David Brandfonbrener
Lucas Janson
Sham Kakade
Sham Kakade
46
23
0
17 Sep 2024
AdaFisher: Adaptive Second Order Optimization via Fisher Information
Damien Martins Gomes
Yanlei Zhang
Eugene Belilovsky
Guy Wolf
Mahdi S. Hosseini
ODL
57
2
0
26 May 2024
Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be
Frederik Kunstner
Jacques Chen
J. Lavington
Mark W. Schmidt
28
42
0
27 Apr 2023
Fast and Scalable Bayesian Deep Learning by Weight-Perturbation in Adam
Mohammad Emtiyaz Khan
Didrik Nielsen
Voot Tangkaratt
Wu Lin
Y. Gal
Akash Srivastava
ODL
71
264
0
13 Jun 2018
Variational Optimization
J. Staines
David Barber
DRL
45
52
0
18 Dec 2012
1