ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2409.19913
  4. Cited By
Scaling Optimal LR Across Token Horizons
v1v2v3 (latest)

Scaling Optimal LR Across Token Horizons

International Conference on Learning Representations (ICLR), 2024
30 September 2024
Johan Bjorck
Alon Benhaim
Vishrav Chaudhary
Furu Wei
Xia Song
ArXiv (abs)PDFHTMLHuggingFace (1 upvotes)

Papers citing "Scaling Optimal LR Across Token Horizons"

49 / 49 papers shown
Robust Layerwise Scaling Rules by Proper Weight Decay Tuning
Robust Layerwise Scaling Rules by Proper Weight Decay Tuning
Zhiyuan Fan
Yifeng Liu
Qingyue Zhao
Angela Yuan
Quanquan Gu
124
3
0
17 Oct 2025
Latent Representation Learning in Heavy-Ion Collisions with MaskPoint Transformer
Latent Representation Learning in Heavy-Ion Collisions with MaskPoint Transformer
Jing-Zong Zhang
Shuang Guo
Li-Lin Zhu
Lingxiao Wang
Guo-Liang Ma
183
10
0
08 Oct 2025
Optimal Scaling Needs Optimal Norm
Optimal Scaling Needs Optimal Norm
Oleg Filatov
Jiangtao Wang
J. Ebert
Stefan Kesselheim
209
2
0
04 Oct 2025
Efficient Hyperparameter Tuning via Trajectory Invariance Principle
Efficient Hyperparameter Tuning via Trajectory Invariance Principle
Bingrui Li
Jiaxin Wen
Zhanpeng Zhou
Jun-Jie Zhu
Jianfei Chen
111
1
0
29 Sep 2025
Scaling with Collapse: Efficient and Predictable Training of LLM Families
Scaling with Collapse: Efficient and Predictable Training of LLM Families
Shane Bergsma
Bin Claire Zhang
Nolan Dey
Shaheer Muhammad
Gurpreet Gosal
Joel Hestness
164
4
0
29 Sep 2025
The Importance of Being Lazy: Scaling Limits of Continual Learning
The Importance of Being Lazy: Scaling Limits of Continual Learning
Jacopo Graldi
Alessandro Breccia
Giulia Lanzillotta
Thomas Hofmann
Lorenzo Noci
CLL
339
2
0
20 Jun 2025
MiniCPM4: Ultra-Efficient LLMs on End Devices
MiniCPM4: Ultra-Efficient LLMs on End Devices
MiniCPM Team
Chaojun Xiao
Yuxuan Li
Xu Han
Yuzhuo Bai
...
Zhiyuan Liu
Guoyang Zeng
Chao Jia
Dahai Li
Maosong Sun
MLLM
331
30
0
09 Jun 2025
Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training
Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training
Shane Bergsma
Nolan Dey
Gurpreet Gosal
Gavia Gray
Daria Soboleva
Joel Hestness
435
19
0
19 May 2025
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Abdelrahman Abouelenin
Atabak Ashfaq
Adam Atkinson
Hany Awadalla
Nguyen Bach
...
Ishmam Zabir
Yunan Zhang
Li Zhang
Yanzhe Zhang
Xiren Zhou
MoESyDa
319
336
0
03 Mar 2025
Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs
Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMsInternational Conference on Learning Representations (ICLR), 2025
Shane Bergsma
Nolan Dey
Gurpreet Gosal
Gavia Gray
Daria Soboleva
Joel Hestness
359
25
0
21 Feb 2025
u-$\mu$P: The Unit-Scaled Maximal Update Parametrization
u-μ\muμP: The Unit-Scaled Maximal Update Parametrization
Charlie Blake
C. Eichenberg
Josef Dean
Lukas Balles
Luke Y. Prince
Bjorn Deiseroth
Andres Felipe Cruz Salinas
Carlo Luschi
Samuel Weinbach
Douglas Orr
363
20
0
24 Jul 2024
Scaling Exponents Across Parameterizations and Optimizers
Scaling Exponents Across Parameterizations and Optimizers
Katie Everett
Lechao Xiao
Mitchell Wortsman
Alexander A. Alemi
Roman Novak
...
Izzeddin Gur
Jascha Narain Sohl-Dickstein
L. Kaelbling
Jaehoon Lee
Jeffrey Pennington
275
52
0
08 Jul 2024
The FineWeb Datasets: Decanting the Web for the Finest Text Data at
  Scale
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Guilherme Penedo
Hynek Kydlícek
Loubna Ben Allal
Anton Lozhkov
Margaret Mitchell
Colin Raffel
Leandro von Werra
Thomas Wolf
407
687
0
25 Jun 2024
How to set AdamW's weight decay as you scale model and dataset size
How to set AdamW's weight decay as you scale model and dataset size
Xi Wang
Laurence Aitchison
581
31
0
22 May 2024
Wukong: Towards a Scaling Law for Large-Scale Recommendation
Wukong: Towards a Scaling Law for Large-Scale Recommendation
Buyun Zhang
Liang Luo
Yuxin Chen
Jade Nie
Xi Liu
...
Guna Lakshminarayanan
Ellie Wen
Jongsoo Park
Maxim Naumov
Wenlin Chen
ALM
358
83
0
04 Mar 2024
When Scaling Meets LLM Finetuning: The Effect of Data, Model and
  Finetuning Method
When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method
Biao Zhang
Zhongtao Liu
Colin Cherry
Orhan Firat
LRM
298
245
0
27 Feb 2024
Scaling Laws for Fine-Grained Mixture of Experts
Scaling Laws for Fine-Grained Mixture of Experts
Jakub Krajewski
Jan Ludziejewski
Kamil Adamczewski
Maciej Pióro
Michal Krutul
...
Krystian Król
Tomasz Odrzygó'zd'z
Piotr Sankowski
Marek Cygan
Sebastian Jaszczur
MoE
256
125
0
12 Feb 2024
A Tale of Tails: Model Collapse as a Change of Scaling Laws
A Tale of Tails: Model Collapse as a Change of Scaling LawsInternational Conference on Machine Learning (ICML), 2024
Elvis Dohmatob
Yunzhen Feng
Pu Yang
Francois Charton
Julia Kempe
328
115
0
10 Feb 2024
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
Selecting Large Language Model to Fine-tune via Rectified Scaling Law
Haowei Lin
Baizhou Huang
Haotian Ye
Qinyu Chen
Zihao Wang
Sujian Li
Jianzhu Ma
Xiaojun Wan
James Zou
Yitao Liang
366
30
0
04 Feb 2024
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek-AI Xiao Bi
:
Xiao Bi
Deli Chen
Guanting Chen
...
Yao Zhao
Shangyan Zhou
Shunfeng Zhou
Qihao Zhu
Yuheng Zou
LRMALM
403
670
0
05 Jan 2024
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling LawsInternational Conference on Machine Learning (ICML), 2023
Nikhil Sardana
Jacob P. Portes
Sasha Doubov
Jonathan Frankle
LRM
1.1K
126
0
31 Dec 2023
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu
Tri Dao
Mamba
707
5,889
0
01 Dec 2023
Small-scale proxies for large-scale Transformer training instabilities
Small-scale proxies for large-scale Transformer training instabilitiesInternational Conference on Learning Representations (ICLR), 2023
Mitchell Wortsman
Peter J. Liu
Lechao Xiao
Katie Everett
A. Alemi
...
Jascha Narain Sohl-Dickstein
Kelvin Xu
Jaehoon Lee
Justin Gilmer
Simon Kornblith
347
150
0
25 Sep 2023
Scaling Laws for Sparsely-Connected Foundation Models
Scaling Laws for Sparsely-Connected Foundation ModelsInternational Conference on Learning Representations (ICLR), 2023
Elias Frantar
C. Riquelme
N. Houlsby
Dan Alistarh
Utku Evci
341
46
0
15 Sep 2023
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora
  with Web Data, and Web Data Only
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Guilherme Penedo
Quentin Malartic
Daniel Hesslow
Ruxandra-Aimée Cojocaru
Alessandro Cappelli
Hamza Alobeidli
B. Pannier
Ebtesam Almazrouei
Julien Launay
450
909
0
01 Jun 2023
Scaling Data-Constrained Language Models
Scaling Data-Constrained Language ModelsNeural Information Processing Systems (NeurIPS), 2023
Niklas Muennighoff
Alexander M. Rush
Boaz Barak
Teven Le Scao
Aleksandra Piktus
Nouamane Tazi
S. Pyysalo
Thomas Wolf
Colin Raffel
ALM
718
343
0
25 May 2023
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head
  Checkpoints
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head CheckpointsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Joshua Ainslie
James Lee-Thorp
Michiel de Jong
Yury Zemlyanskiy
Federico Lebrón
Sumit Sanghai
464
1,192
0
22 May 2023
Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design
Getting ViT in Shape: Scaling Laws for Compute-Optimal Model DesignNeural Information Processing Systems (NeurIPS), 2023
Ibrahim Alabdulmohsin
Xiaohua Zhai
Alexander Kolesnikov
Lucas Beyer
VLM
639
96
0
22 May 2023
The Quantization Model of Neural Scaling
The Quantization Model of Neural ScalingNeural Information Processing Systems (NeurIPS), 2023
Eric J. Michaud
Ziming Liu
Uzay Girit
Max Tegmark
MILM
381
126
0
23 Mar 2023
Language Is Not All You Need: Aligning Perception with Language Models
Language Is Not All You Need: Aligning Perception with Language ModelsNeural Information Processing Systems (NeurIPS), 2023
Shaohan Huang
Li Dong
Wenhui Wang
Y. Hao
Saksham Singhal
...
Johan Bjorck
Vishrav Chaudhary
Subhojit Som
Xia Song
Furu Wei
VLMLRMMLLM
358
706
0
27 Feb 2023
LLaMA: Open and Efficient Foundation Language Models
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron
Thibaut Lavril
Gautier Izacard
Xavier Martinet
Marie-Anne Lachaux
...
Faisal Azhar
Aurelien Rodriguez
Armand Joulin
Edouard Grave
Guillaume Lample
ALMPILM
17.2K
18,610
0
27 Feb 2023
Scaling Laws for Multilingual Neural Machine Translation
Scaling Laws for Multilingual Neural Machine TranslationInternational Conference on Machine Learning (ICML), 2023
Patrick Fernandes
Behrooz Ghorbani
Xavier Garcia
Markus Freitag
Orhan Firat
249
37
0
19 Feb 2023
Scaling Vision Transformers to 22 Billion Parameters
Scaling Vision Transformers to 22 Billion ParametersInternational Conference on Machine Learning (ICML), 2023
Mostafa Dehghani
Josip Djolonga
Basil Mustafa
Piotr Padlewski
Jonathan Heek
...
Mario Luvcić
Xiaohua Zhai
Daniel Keysers
Jeremiah Harmsen
N. Houlsby
MLLM
426
810
0
10 Feb 2023
Scaling Laws for Generative Mixed-Modal Language Models
Scaling Laws for Generative Mixed-Modal Language ModelsInternational Conference on Machine Learning (ICML), 2023
Armen Aghajanyan
L. Yu
Alexis Conneau
Wei-Ning Hsu
Karen Hambardzumyan
Susan Zhang
Stephen Roller
Naman Goyal
Omer Levy
Luke Zettlemoyer
MoEVLM
363
141
0
10 Jan 2023
Reproducible scaling laws for contrastive language-image learning
Reproducible scaling laws for contrastive language-image learningComputer Vision and Pattern Recognition (CVPR), 2022
Mehdi Cherti
Romain Beaumont
Ross Wightman
Mitchell Wortsman
Gabriel Ilharco
Cade Gordon
Christoph Schuhmann
Ludwig Schmidt
J. Jitsev
VLMCLIP
562
1,236
0
14 Dec 2022
Broken Neural Scaling Laws
Broken Neural Scaling LawsInternational Conference on Learning Representations (ICLR), 2022
Ethan Caballero
Kshitij Gupta
Irina Rish
David M. Krueger
1.2K
104
0
26 Oct 2022
Scaling Laws for Reward Model Overoptimization
Scaling Laws for Reward Model OveroptimizationInternational Conference on Machine Learning (ICML), 2022
Leo Gao
John Schulman
Jacob Hilton
ALM
502
859
0
19 Oct 2022
Scaling Laws for a Multi-Agent Reinforcement Learning Model
Scaling Laws for a Multi-Agent Reinforcement Learning ModelInternational Conference on Learning Representations (ICLR), 2022
Oren Neumann
C. Gros
350
40
0
29 Sep 2022
Emergent Abilities of Large Language Models
Emergent Abilities of Large Language Models
Jason W. Wei
Yi Tay
Rishi Bommasani
Colin Raffel
Barret Zoph
...
Tatsunori Hashimoto
Oriol Vinyals
Abigail Z. Jacobs
J. Dean
W. Fedus
ELMReLMLRM
584
3,287
0
15 Jun 2022
Training Compute-Optimal Large Language Models
Training Compute-Optimal Large Language Models
Jordan Hoffmann
Sebastian Borgeaud
A. Mensch
Elena Buchatskaya
Trevor Cai
...
Karen Simonyan
Erich Elsen
Jack W. Rae
Oriol Vinyals
Laurent Sifre
AI4TS
924
2,864
0
29 Mar 2022
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot
  Hyperparameter Transfer
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
Greg Yang
J. E. Hu
Igor Babuschkin
Szymon Sidor
Xiaodong Liu
David Farhi
Nick Ryder
J. Pachocki
Weizhu Chen
Jianfeng Gao
428
241
0
07 Mar 2022
Understanding Decoupled and Early Weight Decay
Understanding Decoupled and Early Weight DecayAAAI Conference on Artificial Intelligence (AAAI), 2020
Johan Bjorck
Kilian Q. Weinberger
Daniel Schwalbe-Koda
160
35
0
27 Dec 2020
Array Programming with NumPy
Array Programming with NumPy
Charles R. Harris
K. Millman
S. Walt
R. Gommers
Pauli Virtanen
...
Tyler Reddy
Warren Weckesser
Hameer Abbasi
C. Gohlke
T. Oliphant
574
18,914
0
18 Jun 2020
Language Models are Few-Shot Learners
Language Models are Few-Shot LearnersNeural Information Processing Systems (NeurIPS), 2020
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
2.2K
54,992
0
28 May 2020
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
2.0K
7,140
0
23 Jan 2020
Megatron-LM: Training Multi-Billion Parameter Language Models Using
  Model Parallelism
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
2.0K
2,545
0
17 Sep 2019
Attention Is All You Need
Attention Is All You NeedNeural Information Processing Systems (NeurIPS), 2017
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
8.0K
168,453
0
12 Jun 2017
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Priya Goyal
Piotr Dollár
Ross B. Girshick
P. Noordhuis
Lukasz Wesolowski
Aapo Kyrola
Andrew Tulloch
Yangqing Jia
Kaiming He
3DH
639
3,989
0
08 Jun 2017
Outrageously Large Neural Networks: The Sparsely-Gated
  Mixture-of-Experts Layer
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts LayerInternational Conference on Learning Representations (ICLR), 2017
Noam M. Shazeer
Azalia Mirhoseini
Krzysztof Maziarz
Andy Davis
Quoc V. Le
Geoffrey E. Hinton
J. Dean
MoE
675
4,066
0
23 Jan 2017
1
Page 1 of 1