AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods

17 February 2024

Papers citing "AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods"

8 / 8 papers shown

Title
Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism Tim Tsz-Kit Lau Weijian Li Chenwei Xu Han Liu Mladen Kolar 62 0 0 30 Dec 2024
Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be Frederik Kunstner Jacques Chen J. Lavington Mark W. Schmidt 38 66 0 27 Apr 2023
Adaptive Sampling Quasi-Newton Methods for Zeroth-Order Stochastic Optimization Raghu Bollapragada Stefan M. Wild 19 11 0 24 Sep 2021
A High Probability Analysis of Adaptive SGD with Momentum Xiaoyun Li Francesco Orabona 79 64 0 28 Jul 2020
A Simple Convergence Proof of Adam and Adagrad Alexandre Défossez Léon Bottou Francis R. Bach Nicolas Usunier 56 143 0 05 Mar 2020
Scaling Laws for Neural Language Models Jared Kaplan Sam McCandlish T. Henighan Tom B. Brown B. Chess R. Child Scott Gray Alec Radford Jeff Wu Dario Amodei 220 4,424 0 23 Jan 2020
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism M. Shoeybi M. Patwary Raul Puri P. LeGresley Jared Casper Bryan Catanzaro MoE 243 1,791 0 17 Sep 2019
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima N. Keskar Dheevatsa Mudigere J. Nocedal M. Smelyanskiy P. T. P. Tang ODL 273 2,878 0 15 Sep 2016