Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2405.16002
Cited By
Does SGD really happen in tiny subspaces?
25 May 2024
Minhak Song
Kwangjun Ahn
Chulhee Yun
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Does SGD really happen in tiny subspaces?"
11 / 11 papers shown
Title
Dion: A Communication-Efficient Optimizer for Large Models
Kwangjun Ahn
Byron Xu
20
0
0
07 Apr 2025
The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training
Jinbo Wang
Mingze Wang
Zhanpeng Zhou
Junchi Yan
Weinan E
Lei Wu
65
1
0
26 Feb 2025
Preconditioned Subspace Langevin Monte Carlo
Tyler Maunu
Jiayi Yao
88
0
0
18 Dec 2024
Understanding Gradient Descent through the Training Jacobian
Nora Belrose
Adam Scherlis
65
1
0
09 Dec 2024
Understanding Adam Optimizer via Online Learning of Updates: Adam is FTRL in Disguise
Kwangjun Ahn
Zhiyu Zhang
Yunbum Kook
Yan Dai
30
10
0
02 Feb 2024
Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be
Frederik Kunstner
Jacques Chen
J. Lavington
Mark W. Schmidt
32
42
0
27 Apr 2023
Understanding Edge-of-Stability Training Dynamics with a Minimalist Example
Xingyu Zhu
Zixuan Wang
Xiang Wang
Mo Zhou
Rong Ge
54
35
0
07 Oct 2022
The Dynamics of Sharpness-Aware Minimization: Bouncing Across Ravines and Drifting Towards Wide Minima
Peter L. Bartlett
Philip M. Long
Olivier Bousquet
54
34
0
04 Oct 2022
Understanding Gradient Descent on Edge of Stability in Deep Learning
Sanjeev Arora
Zhiyuan Li
A. Panigrahi
MLT
67
88
0
19 May 2022
What Happens after SGD Reaches Zero Loss? --A Mathematical Framework
Zhiyuan Li
Tianhao Wang
Sanjeev Arora
MLT
83
98
0
13 Oct 2021
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
N. Keskar
Dheevatsa Mudigere
J. Nocedal
M. Smelyanskiy
P. T. P. Tang
ODL
273
2,696
0
15 Sep 2016
1