Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2502.15938
Cited By
Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs
21 February 2025
Shane Bergsma
Nolan Dey
Gurpreet Gosal
Gavia Gray
Daria Soboleva
Joel Hestness
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs"
5 / 5 papers shown
Title
Don't be lazy: CompleteP enables compute-efficient deep transformers
Nolan Dey
Bin Claire Zhang
Lorenzo Noci
Mufan Bill Li
Blake Bordelon
Shane Bergsma
C. Pehlevan
Boris Hanin
Joel Hestness
37
0
0
02 May 2025
The Rise of Small Language Models in Healthcare: A Comprehensive Survey
Muskan Garg
Shaina Raza
Shebuti Rayana
Xingyi Liu
Sunghwan Sohn
LM&MA
AILaw
87
0
0
23 Apr 2025
Mixture of Group Experts for Learning Invariant Representations
Lei Kang
Jia Li
Mi Tian
Hua Huang
MoE
20
0
0
12 Apr 2025
A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules
Kairong Luo
Haodong Wen
Shengding Hu
Zhenbo Sun
Zhiyuan Liu
Maosong Sun
Kaifeng Lyu
Wenguang Chen
CLL
52
0
0
17 Mar 2025
How to set AdamW's weight decay as you scale model and dataset size
Xi Wang
Laurence Aitchison
35
9
0
22 May 2024
1