ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.15196
  4. Cited By
The Marginal Value of Momentum for Small Learning Rate SGD

The Marginal Value of Momentum for Small Learning Rate SGD

27 July 2023
Runzhe Wang
Sadhika Malladi
Tianhao Wang
Kaifeng Lyu
Zhiyuan Li
    ODL
ArXivPDFHTML

Papers citing "The Marginal Value of Momentum for Small Learning Rate SGD"

14 / 14 papers shown
Title
On the Performance Analysis of Momentum Method: A Frequency Domain Perspective
On the Performance Analysis of Momentum Method: A Frequency Domain Perspective
Xianliang Li
Jun Luo
Zhiwei Zheng
Hanxiao Wang
Li Luo
Lingkun Wen
Linlong Wu
Sheng Xu
72
0
0
29 Nov 2024
Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training
Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training
Atli Kosson
Bettina Messmer
Martin Jaggi
AI4CE
18
2
0
31 Oct 2024
Does SGD really happen in tiny subspaces?
Does SGD really happen in tiny subspaces?
Minhak Song
Kwangjun Ahn
Chulhee Yun
53
4
1
25 May 2024
(Accelerated) Noise-adaptive Stochastic Heavy-Ball Momentum
(Accelerated) Noise-adaptive Stochastic Heavy-Ball Momentum
Anh Dang
Reza Babanezhad
Sharan Vaswani
14
0
0
12 Jan 2024
Accelerated Convergence of Stochastic Heavy Ball Method under
  Anisotropic Gradient Noise
Accelerated Convergence of Stochastic Heavy Ball Method under Anisotropic Gradient Noise
Rui Pan
Yuxing Liu
Xiaoyu Wang
Tong Zhang
13
5
0
22 Dec 2023
A Quadratic Synchronization Rule for Distributed Deep Learning
A Quadratic Synchronization Rule for Distributed Deep Learning
Xinran Gu
Kaifeng Lyu
Sanjeev Arora
Jingzhao Zhang
Longbo Huang
28
1
0
22 Oct 2023
Flatter, faster: scaling momentum for optimal speedup of SGD
Flatter, faster: scaling momentum for optimal speedup of SGD
Aditya Cowsik
T. Can
Paolo Glorioso
47
5
0
28 Oct 2022
A Kernel-Based View of Language Model Fine-Tuning
A Kernel-Based View of Language Model Fine-Tuning
Sadhika Malladi
Alexander Wettig
Dingli Yu
Danqi Chen
Sanjeev Arora
VLM
66
60
0
11 Oct 2022
On the SDEs and Scaling Rules for Adaptive Gradient Algorithms
On the SDEs and Scaling Rules for Adaptive Gradient Algorithms
Sadhika Malladi
Kaifeng Lyu
A. Panigrahi
Sanjeev Arora
88
40
0
20 May 2022
Understanding Gradient Descent on Edge of Stability in Deep Learning
Understanding Gradient Descent on Edge of Stability in Deep Learning
Sanjeev Arora
Zhiyuan Li
A. Panigrahi
MLT
72
88
0
19 May 2022
What Happens after SGD Reaches Zero Loss? --A Mathematical Framework
What Happens after SGD Reaches Zero Loss? --A Mathematical Framework
Zhiyuan Li
Tianhao Wang
Sanjeev Arora
MLT
83
98
0
13 Oct 2021
Making Pre-trained Language Models Better Few-shot Learners
Making Pre-trained Language Models Better Few-shot Learners
Tianyu Gao
Adam Fisch
Danqi Chen
241
1,898
0
31 Dec 2020
Quasi-hyperbolic momentum and Adam for deep learning
Quasi-hyperbolic momentum and Adam for deep learning
Jerry Ma
Denis Yarats
ODL
73
126
0
16 Oct 2018
A disciplined approach to neural network hyper-parameters: Part 1 --
  learning rate, batch size, momentum, and weight decay
A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay
L. Smith
191
1,007
0
26 Mar 2018
1