Does SGD really happen in tiny subspaces?

25 May 2024

Papers citing "Does SGD really happen in tiny subspaces?"

11 / 11 papers shown

Title
Dion: A Communication-Efficient Optimizer for Large Models Kwangjun Ahn Byron Xu 20 0 0 07 Apr 2025
The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training Jinbo Wang Mingze Wang Zhanpeng Zhou Junchi Yan Weinan E Lei Wu 65 1 0 26 Feb 2025
Preconditioned Subspace Langevin Monte Carlo Tyler Maunu Jiayi Yao 88 0 0 18 Dec 2024
Understanding Gradient Descent through the Training Jacobian Nora Belrose Adam Scherlis 65 1 0 09 Dec 2024
Understanding Adam Optimizer via Online Learning of Updates: Adam is FTRL in Disguise Kwangjun Ahn Zhiyu Zhang Yunbum Kook Yan Dai 30 10 0 02 Feb 2024
Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be Frederik Kunstner Jacques Chen J. Lavington Mark W. Schmidt 32 42 0 27 Apr 2023
Understanding Edge-of-Stability Training Dynamics with a Minimalist Example Xingyu Zhu Zixuan Wang Xiang Wang Mo Zhou Rong Ge 54 35 0 07 Oct 2022
The Dynamics of Sharpness-Aware Minimization: Bouncing Across Ravines and Drifting Towards Wide Minima Peter L. Bartlett Philip M. Long Olivier Bousquet 54 34 0 04 Oct 2022
Understanding Gradient Descent on Edge of Stability in Deep Learning Sanjeev Arora Zhiyuan Li A. Panigrahi MLT 67 88 0 19 May 2022
What Happens after SGD Reaches Zero Loss? --A Mathematical Framework Zhiyuan Li Tianhao Wang Sanjeev Arora MLT 83 98 0 13 Oct 2021
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima N. Keskar Dheevatsa Mudigere J. Nocedal M. Smelyanskiy P. T. P. Tang ODL 273 2,696 0 15 Sep 2016