Scaling Laws and Compute-Optimal Training Beyond Fixed Training
Durations

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

28 May 2024

Alexander Hägele

Loubna Ben Allal

Leandro von Werra

Papers citing "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"

11 / 11 papers shown

Title
Trillion 7B Technical Report Sungjun Han Juyoung Suk Suyeong An Hyungguk Kim Kyuseok Kim Wonsuk Yang Seungtaek Choi Jamin Shin 25 0 0 21 Apr 2025
Time Transfer: On Optimal Learning Rate and Batch Size In The Infinite Data Limit Oleg Filatov Jan Ebert Jiangtao Wang Stefan Kesselheim 32 3 0 10 Jan 2025
Exponential Moving Average of Weights in Deep Learning: Dynamics and Benefits Daniel Morales-Brotons Thijs Vogels Hadrien Hendrikx 104 12 0 27 Nov 2024
Resolving Discrepancies in Compute-Optimal Scaling of Language Models Tomer Porian Mitchell Wortsman J. Jitsev Ludwig Schmidt Y. Carmon 44 19 0 27 Jun 2024
Language models scale reliably with over-training and on downstream tasks S. Gadre Georgios Smyrnis Vaishaal Shankar Suchin Gururangan Mitchell Wortsman ... Y. Carmon Achal Dave Reinhard Heckel Niklas Muennighoff Ludwig Schmidt ALM ELM LRM 91 40 0 13 Mar 2024
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models Soham De Samuel L. Smith Anushan Fernando Aleksandar Botev George-Christian Muraru ... David Budden Yee Whye Teh Razvan Pascanu Nando de Freitas Çağlar Gülçehre Mamba 51 116 0 29 Feb 2024
CroissantLLM: A Truly Bilingual French-English Language Model Manuel Faysse Patrick Fernandes Nuno M. Guerreiro António Loison Duarte M. Alves ... François Yvon André F.T. Martins Gautier Viaud C´eline Hudelot Pierre Colombo 34 33 0 01 Feb 2024
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism DeepSeek-AI Xiao Bi : Xiao Bi Deli Chen Guanting Chen ... Yao Zhao Shangyan Zhou Shunfeng Zhou Qihao Zhu Yuheng Zou LRM ALM 133 298 0 05 Jan 2024
Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers Yi Tay Mostafa Dehghani J. Rao W. Fedus Samira Abnar Hyung Won Chung Sharan Narang Dani Yogatama Ashish Vaswani Donald Metzler 181 89 0 22 Sep 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 236 1,508 0 31 Dec 2020
Scaling Laws for Neural Language Models Jared Kaplan Sam McCandlish T. Henighan Tom B. Brown B. Chess R. Child Scott Gray Alec Radford Jeff Wu Dario Amodei 220 3,054 0 23 Jan 2020