Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2407.07972
Cited By
Deconstructing What Makes a Good Optimizer for Language Models
10 July 2024
Rosie Zhao
Depen Morwani
David Brandfonbrener
Nikhil Vyas
Sham Kakade
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Deconstructing What Makes a Good Optimizer for Language Models"
18 / 18 papers shown
Title
A multilevel approach to accelerate the training of Transformers
Guillaume Lauga
Maël Chaumette
Edgar Desainte-Maréville
Étienne Lasalle
Arthur Lebeurrier
AI4CE
29
0
0
24 Apr 2025
When Can You Get Away with Low Memory Adam?
Dayal Singh Kalra
John Kirchenbauer
M. Barkeshli
Tom Goldstein
69
0
0
03 Mar 2025
Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam
Tianjin Huang
Haotian Hu
Zhenyu (Allen) Zhang
Gaojie Jin
X. Li
...
Tianlong Chen
Lu Liu
Qingsong Wen
Zhangyang Wang
Shiwei Liu
MQ
31
0
0
24 Feb 2025
COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs
Liming Liu
Zhenghao Xu
Zixuan Zhang
Hao Kang
Zichong Li
Chen Liang
Weizhu Chen
T. Zhao
47
1
0
24 Feb 2025
Gradient Multi-Normalization for Stateless and Scalable LLM Training
M. Scetbon
Chao Ma
Wenbo Gong
Edward Meeds
97
1
0
10 Feb 2025
Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers
Akiyoshi Tomihari
Issei Sato
ODL
51
0
0
31 Jan 2025
FOCUS: First Order Concentrated Updating Scheme
Yizhou Liu
Ziming Liu
Jeff Gore
ODL
104
0
0
21 Jan 2025
Loss-to-Loss Prediction: Scaling Laws for All Datasets
David Brandfonbrener
Nikhil Anand
Nikhil Vyas
Eran Malach
Sham Kakade
74
2
0
19 Nov 2024
FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training
Philip Zmushko
Aleksandr Beznosikov
Martin Takáč
Samuel Horváth
32
0
0
12 Nov 2024
Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?
Xi Chen
Kaituo Feng
Changsheng Li
Xunhao Lai
Xiangyu Yue
Ye Yuan
Guoren Wang
29
7
0
02 Oct 2024
Old Optimizer, New Norm: An Anthology
Jeremy Bernstein
Laker Newhouse
ODL
28
12
0
30 Sep 2024
How Feature Learning Can Improve Neural Scaling Laws
Blake Bordelon
Alexander B. Atanasov
C. Pehlevan
44
11
0
26 Sep 2024
SOAP: Improving and Stabilizing Shampoo using Adam
Nikhil Vyas
Depen Morwani
Rosie Zhao
Itai Shapira
David Brandfonbrener
Lucas Janson
Sham Kakade
Sham Kakade
54
23
0
17 Sep 2024
The AdEMAMix Optimizer: Better, Faster, Older
Matteo Pagliardini
Pierre Ablin
David Grangier
ODL
28
8
0
05 Sep 2024
Adam-mini: Use Fewer Learning Rates To Gain More
Yushun Zhang
Congliang Chen
Ziniu Li
Tian Ding
Chenwei Wu
Yinyu Ye
Zhi-Quan Luo
Ruoyu Sun
21
33
0
24 Jun 2024
OLMo: Accelerating the Science of Language Models
Dirk Groeneveld
Iz Beltagy
Pete Walsh
Akshita Bhagia
Rodney Michael Kinney
...
Jesse Dodge
Kyle Lo
Luca Soldaini
Noah A. Smith
Hanna Hajishirzi
OSLM
124
349
0
01 Feb 2024
Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be
Frederik Kunstner
Jacques Chen
J. Lavington
Mark W. Schmidt
35
42
0
27 Apr 2023
Learning by Turning: Neural Architecture Aware Optimisation
Yang Liu
Jeremy Bernstein
M. Meister
Yisong Yue
ODL
28
26
0
14 Feb 2021
1