How Does Critical Batch Size Scale in Pre-training?

29 October 2024

Papers citing "How Does Critical Batch Size Scale in Pre-training?"

6 / 6 papers shown

Title
1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities Kevin Wang Ishaan Javali Michał Bortkiewicz Tomasz Trzciñski Benjamin Eysenbach SSL OffRL 62 0 0 19 Mar 2025
Training and Inference Efficiency of Encoder-Decoder Speech Models Piotr .Zelasko Kunal Dhawan Daniel Galvez Krishna C. Puvvada Ankita Pasad Nithin Rao Koluguri Ke Hu Vitaly Lavrukhin Jagadeesh Balam Boris Ginsburg 36 0 0 07 Mar 2025
Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs Shane Bergsma Nolan Dey Gurpreet Gosal Gavia Gray Daria Soboleva Joel Hestness 50 5 0 21 Feb 2025
Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism Tim Tsz-Kit Lau Weijian Li Chenwei Xu Han Liu Mladen Kolar 55 0 0 30 Dec 2024
Deconstructing What Makes a Good Optimizer for Language Models Rosie Zhao Depen Morwani David Brandfonbrener Nikhil Vyas Sham Kakade 39 17 0 10 Jul 2024
How to set AdamW's weight decay as you scale model and dataset size Xi Wang Laurence Aitchison 38 9 0 22 May 2024