ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.00204
  4. Cited By
Toward Understanding Why Adam Converges Faster Than SGD for Transformers

Toward Understanding Why Adam Converges Faster Than SGD for Transformers

31 May 2023
Yan Pan
Yuanzhi Li
ArXivPDFHTML

Papers citing "Toward Understanding Why Adam Converges Faster Than SGD for Transformers"

10 / 10 papers shown
Title
Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers
Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers
Akiyoshi Tomihari
Issei Sato
ODL
61
1
0
31 Jan 2025
Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism
Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism
Tim Tsz-Kit Lau
Weijian Li
Chenwei Xu
Han Liu
Mladen Kolar
147
0
0
30 Dec 2024
What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis
What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis
Weronika Ormaniec
Felix Dangel
Sidak Pal Singh
35
6
0
14 Oct 2024
Deconstructing What Makes a Good Optimizer for Language Models
Deconstructing What Makes a Good Optimizer for Language Models
Rosie Zhao
Depen Morwani
David Brandfonbrener
Nikhil Vyas
Sham Kakade
50
17
0
10 Jul 2024
Directional Smoothness and Gradient Methods: Convergence and Adaptivity
Directional Smoothness and Gradient Methods: Convergence and Adaptivity
Aaron Mishkin
Ahmed Khaled
Yuanhao Wang
Aaron Defazio
Robert Mansel Gower
44
6
0
06 Mar 2024
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent
  on Language Models
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models
Frederik Kunstner
Robin Yadav
Alan Milligan
Mark Schmidt
Alberto Bietti
39
26
0
29 Feb 2024
Deepfake Detection and the Impact of Limited Computing Capabilities
Deepfake Detection and the Impact of Limited Computing Capabilities
Paloma Cantero-Arjona
Alfonso Sánchez-Macián
33
2
0
08 Feb 2024
High-Dimensional Private Empirical Risk Minimization by Greedy
  Coordinate Descent
High-Dimensional Private Empirical Risk Minimization by Greedy Coordinate Descent
Paul Mangold
A. Bellet
Joseph Salmon
Marc Tommasi
42
5
0
04 Jul 2022
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
261
1,996
0
31 Dec 2020
A Simple Convergence Proof of Adam and Adagrad
A Simple Convergence Proof of Adam and Adagrad
Alexandre Défossez
Léon Bottou
Francis R. Bach
Nicolas Usunier
56
143
0
05 Mar 2020
1