Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2410.21676
Cited By
How Does Critical Batch Size Scale in Pre-training?
29 October 2024
Hanlin Zhang
Depen Morwani
Nikhil Vyas
Jingfeng Wu
Difan Zou
Udaya Ghai
Dean Phillips Foster
Sham Kakade
Re-assign community
ArXiv
PDF
HTML
Papers citing
"How Does Critical Batch Size Scale in Pre-training?"
6 / 6 papers shown
Title
1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities
Kevin Wang
Ishaan Javali
Michał Bortkiewicz
Tomasz Trzciñski
Benjamin Eysenbach
SSL
OffRL
62
0
0
19 Mar 2025
Training and Inference Efficiency of Encoder-Decoder Speech Models
Piotr .Zelasko
Kunal Dhawan
Daniel Galvez
Krishna C. Puvvada
Ankita Pasad
Nithin Rao Koluguri
Ke Hu
Vitaly Lavrukhin
Jagadeesh Balam
Boris Ginsburg
36
0
0
07 Mar 2025
Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs
Shane Bergsma
Nolan Dey
Gurpreet Gosal
Gavia Gray
Daria Soboleva
Joel Hestness
50
5
0
21 Feb 2025
Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism
Tim Tsz-Kit Lau
Weijian Li
Chenwei Xu
Han Liu
Mladen Kolar
55
0
0
30 Dec 2024
Deconstructing What Makes a Good Optimizer for Language Models
Rosie Zhao
Depen Morwani
David Brandfonbrener
Nikhil Vyas
Sham Kakade
39
17
0
10 Jul 2024
How to set AdamW's weight decay as you scale model and dataset size
Xi Wang
Laurence Aitchison
38
9
0
22 May 2024
1