ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.19449
  4. Cited By
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent
  on Language Models

Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models

29 February 2024
Frederik Kunstner
Robin Yadav
Alan Milligan
Mark Schmidt
Alberto Bietti
ArXivPDFHTML

Papers citing "Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models"

19 / 19 papers shown
Title
Towards Quantifying the Hessian Structure of Neural Networks
Towards Quantifying the Hessian Structure of Neural Networks
Zhaorui Dong
Yushun Zhang
Z. Luo
Jianfeng Yao
Ruoyu Sun
24
0
0
05 May 2025
When Can You Get Away with Low Memory Adam?
When Can You Get Away with Low Memory Adam?
Dayal Singh Kalra
John Kirchenbauer
M. Barkeshli
Tom Goldstein
69
0
0
03 Mar 2025
The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training
The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training
Jinbo Wang
Mingze Wang
Zhanpeng Zhou
Junchi Yan
Weinan E
Lei Wu
67
1
0
26 Feb 2025
Gradient Multi-Normalization for Stateless and Scalable LLM Training
Gradient Multi-Normalization for Stateless and Scalable LLM Training
M. Scetbon
Chao Ma
Wenbo Gong
Edward Meeds
97
1
0
10 Feb 2025
Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers
Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers
Akiyoshi Tomihari
Issei Sato
ODL
59
0
0
31 Jan 2025
Class Imbalance in Anomaly Detection: Learning from an Exactly Solvable Model
Class Imbalance in Anomaly Detection: Learning from an Exactly Solvable Model
F.S. Pezzicoli
V. Ros
F.P. Landes
M. Baity-Jesi
32
1
0
20 Jan 2025
Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism
Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism
Tim Tsz-Kit Lau
Weijian Li
Chenwei Xu
Han Liu
Mladen Kolar
55
0
0
30 Dec 2024
Normalization Layer Per-Example Gradients are Sufficient to Predict
  Gradient Noise Scale in Transformers
Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers
Gavia Gray
Aman Tiwari
Shane Bergsma
Joel Hestness
25
1
0
01 Nov 2024
Understanding Adam Requires Better Rotation Dependent Assumptions
Understanding Adam Requires Better Rotation Dependent Assumptions
Lucas Maes
Tianyue H. Zhang
Alexia Jolicoeur-Martineau
Ioannis Mitliagkas
Damien Scieur
Simon Lacoste-Julien
Charles Guille-Escuret
20
1
0
25 Oct 2024
Gradient Normalization Provably Benefits Nonconvex SGD under
  Heavy-Tailed Noise
Gradient Normalization Provably Benefits Nonconvex SGD under Heavy-Tailed Noise
Tao Sun
Xinwang Liu
Kun Yuan
21
0
0
21 Oct 2024
Deconstructing What Makes a Good Optimizer for Language Models
Deconstructing What Makes a Good Optimizer for Language Models
Rosie Zhao
Depen Morwani
David Brandfonbrener
Nikhil Vyas
Sham Kakade
42
17
0
10 Jul 2024
Adam-mini: Use Fewer Learning Rates To Gain More
Adam-mini: Use Fewer Learning Rates To Gain More
Yushun Zhang
Congliang Chen
Ziniu Li
Tian Ding
Chenwei Wu
Yinyu Ye
Zhi-Quan Luo
Ruoyu Sun
28
33
0
24 Jun 2024
Generalization Beyond Data Imbalance: A Controlled Study on CLIP for
  Transferable Insights
Generalization Beyond Data Imbalance: A Controlled Study on CLIP for Transferable Insights
Xin Wen
Bingchen Zhao
Yilun Chen
Jiangmiao Pang
Xiaojuan Qi
22
3
0
31 May 2024
Understanding and Minimising Outlier Features in Neural Network Training
Understanding and Minimising Outlier Features in Neural Network Training
Bobby He
Lorenzo Noci
Daniele Paliotta
Imanol Schlag
Thomas Hofmann
24
3
0
29 May 2024
Does SGD really happen in tiny subspaces?
Does SGD really happen in tiny subspaces?
Minhak Song
Kwangjun Ahn
Chulhee Yun
47
4
1
25 May 2024
Understanding Adam Optimizer via Online Learning of Updates: Adam is
  FTRL in Disguise
Understanding Adam Optimizer via Online Learning of Updates: Adam is FTRL in Disguise
Kwangjun Ahn
Zhiyu Zhang
Yunbum Kook
Yan Dai
32
10
0
02 Feb 2024
Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on
  Transformers, but Sign Descent Might Be
Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be
Frederik Kunstner
Jacques Chen
J. Lavington
Mark W. Schmidt
38
42
0
27 Apr 2023
Interpretability in the Wild: a Circuit for Indirect Object
  Identification in GPT-2 small
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
210
486
0
01 Nov 2022
An Investigation of Why Overparameterization Exacerbates Spurious
  Correlations
An Investigation of Why Overparameterization Exacerbates Spurious Correlations
Shiori Sagawa
Aditi Raghunathan
Pang Wei Koh
Percy Liang
136
368
0
09 May 2020
1