How to set AdamW's weight decay as you scale model and dataset size

22 May 2024

Papers citing "How to set AdamW's weight decay as you scale model and dataset size"

4 / 4 papers shown

Title
Don't be lazy: CompleteP enables compute-efficient deep transformers Nolan Dey Bin Claire Zhang Lorenzo Noci Mufan Bill Li Blake Bordelon Shane Bergsma C. Pehlevan Boris Hanin Joel Hestness 37 0 0 02 May 2025
FOCUS: First Order Concentrated Updating Scheme Yizhou Liu Ziming Liu Jeff Gore ODL 104 0 0 21 Jan 2025
Scaling Optimal LR Across Token Horizons Johan Bjorck Alon Benhaim Vishrav Chaudhary Furu Wei Xia Song 41 4 0 30 Sep 2024
$u-$\mu$P: The Unit-Scaled Maximal Update Parametrization$ u- $\mu$ P: The Unit-Scaled Maximal Update Parametrization Charlie Blake C. Eichenberg Josef Dean Lukas Balles Luke Y. Prince Bjorn Deiseroth Andres Felipe Cruz Salinas Carlo Luschi Samuel Weinbach Douglas Orr 46 9 0 24 Jul 2024