Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2402.11215
Cited By
AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods
17 February 2024
Tim Tsz-Kit Lau
Han Liu
Mladen Kolar
ODL
Re-assign community
ArXiv
PDF
HTML
Papers citing
"AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods"
8 / 8 papers shown
Title
Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism
Tim Tsz-Kit Lau
Weijian Li
Chenwei Xu
Han Liu
Mladen Kolar
62
0
0
30 Dec 2024
Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be
Frederik Kunstner
Jacques Chen
J. Lavington
Mark W. Schmidt
38
66
0
27 Apr 2023
Adaptive Sampling Quasi-Newton Methods for Zeroth-Order Stochastic Optimization
Raghu Bollapragada
Stefan M. Wild
19
11
0
24 Sep 2021
A High Probability Analysis of Adaptive SGD with Momentum
Xiaoyun Li
Francesco Orabona
79
64
0
28 Jul 2020
A Simple Convergence Proof of Adam and Adagrad
Alexandre Défossez
Léon Bottou
Francis R. Bach
Nicolas Usunier
56
143
0
05 Mar 2020
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
220
4,424
0
23 Jan 2020
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
243
1,791
0
17 Sep 2019
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
N. Keskar
Dheevatsa Mudigere
J. Nocedal
M. Smelyanskiy
P. T. P. Tang
ODL
273
2,878
0
15 Sep 2016
1