v1v2v3 (latest)

Scaling Optimal LR Across Token Horizons

International Conference on Learning Representations (ICLR), 2024

30 September 2024

Xia Song

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)Github

Papers citing "Scaling Optimal LR Across Token Horizons"

49 / 49 papers shown

Robust Layerwise Scaling Rules by Proper Weight Decay Tuning

141

17 Oct 2025

Latent Representation Learning in Heavy-Ion Collisions with MaskPoint Transformer

213

08 Oct 2025

Optimal Scaling Needs Optimal Norm

238

04 Oct 2025

Efficient Hyperparameter Tuning via Trajectory Invariance Principle

121

29 Sep 2025

Scaling with Collapse: Efficient and Predictable Training of LLM Families

197

29 Sep 2025

The Importance of Being Lazy: Scaling Limits of Continual Learning

381

20 Jun 2025

MiniCPM4: Ultra-Efficient LLMs on End Devices

...

360

09 Jun 2025

Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training

500

19 May 2025

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Abdelrahman Abouelenin

...

364

374

03 Mar 2025

Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMsInternational Conference on Learning Representations (ICLR), 2025

396

21 Feb 2025

$u-$\mu$P: The Unit-Scaled Maximal Update Parametrization$

\mu

P: The Unit-Scaled Maximal Update Parametrization

Andres Felipe Cruz Salinas

Carlo Luschi

Samuel Weinbach

Douglas Orr

407

24 Jul 2024

Scaling Exponents Across Parameterizations and Optimizers

Alexander A. Alemi

...

Jascha Narain Sohl-Dickstein

L. Kaelbling

Jaehoon Lee

Jeffrey Pennington

313

08 Jul 2024

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

516

744

25 Jun 2024

How to set AdamW's weight decay as you scale model and dataset size

Xi Wang

Laurence Aitchison

644

22 May 2024

Wukong: Towards a Scaling Law for Large-Scale Recommendation

...

Guna Lakshminarayanan

401

04 Mar 2024

When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method

347

260

27 Feb 2024

Scaling Laws for Fine-Grained Mixture of Experts

...

288

137

12 Feb 2024

A Tale of Tails: Model Collapse as a Change of Scaling LawsInternational Conference on Machine Learning (ICML), 2024

345

117

10 Feb 2024

Selecting Large Language Model to Fine-tune via Rectified Scaling Law

Sujian Li

Xiaojun Wan

387

04 Feb 2024

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

...

Yuheng Zou

445

707

05 Jan 2024

Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling LawsInternational Conference on Machine Learning (ICML), 2023

1.1K

136

31 Dec 2023

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu

Tri Dao

Mamba

813

6,333

01 Dec 2023

Small-scale proxies for large-scale Transformer training instabilitiesInternational Conference on Learning Representations (ICLR), 2023

...

Jascha Narain Sohl-Dickstein

Kelvin Xu

Jaehoon Lee

Justin Gilmer

Simon Kornblith

380

158

25 Sep 2023

Scaling Laws for Sparsely-Connected Foundation ModelsInternational Conference on Learning Representations (ICLR), 2023

Dan Alistarh

385

15 Sep 2023

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Guilherme Penedo

Quentin Malartic

Daniel Hesslow

Ruxandra-Aimée Cojocaru

542

924

01 Jun 2023

Scaling Data-Constrained Language ModelsNeural Information Processing Systems (NeurIPS), 2023

791

360

25 May 2023

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head CheckpointsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Joshua Ainslie

Sumit Sanghai

544

1,280

22 May 2023

Getting ViT in Shape: Scaling Laws for Compute-Optimal Model DesignNeural Information Processing Systems (NeurIPS), 2023

Ibrahim Alabdulmohsin

684

102

22 May 2023

The Quantization Model of Neural ScalingNeural Information Processing Systems (NeurIPS), 2023

447

133

23 Mar 2023

Language Is Not All You Need: Aligning Perception with Language ModelsNeural Information Processing Systems (NeurIPS), 2023

...

Xia Song

450

725

27 Feb 2023

LLaMA: Open and Efficient Foundation Language Models

...

20.1K

19,316

27 Feb 2023

Scaling Laws for Multilingual Neural Machine TranslationInternational Conference on Machine Learning (ICML), 2023

269

19 Feb 2023

Scaling Vision Transformers to 22 Billion ParametersInternational Conference on Machine Learning (ICML), 2023

...

467

831

10 Feb 2023

Scaling Laws for Generative Mixed-Modal Language ModelsInternational Conference on Machine Learning (ICML), 2023

Luke Zettlemoyer

431

145

10 Jan 2023

Reproducible scaling laws for contrastive language-image learningComputer Vision and Pattern Recognition (CVPR), 2022

676

1,298

14 Dec 2022

Broken Neural Scaling LawsInternational Conference on Learning Representations (ICLR), 2022

1.2K

105

26 Oct 2022

Scaling Laws for Reward Model OveroptimizationInternational Conference on Machine Learning (ICML), 2022

599

903

19 Oct 2022

Scaling Laws for a Multi-Agent Reinforcement Learning ModelInternational Conference on Learning Representations (ICLR), 2022

Oren Neumann

C. Gros

383

29 Sep 2022

Emergent Abilities of Large Language Models

...

Tatsunori Hashimoto

712

3,398

15 Jun 2022

Training Compute-Optimal Large Language Models

...

1.2K

2,990

29 Mar 2022

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

Xiaodong Liu

489

250

07 Mar 2022

Understanding Decoupled and Early Weight DecayAAAI Conference on Artificial Intelligence (AAAI), 2020

Johan Bjorck

Kilian Q. Weinberger

Daniel Schwalbe-Koda

180

27 Dec 2020

Array Programming with NumPy

...

735

19,414

18 Jun 2020

Language Models are Few-Shot LearnersNeural Information Processing Systems (NeurIPS), 2020

...

2.4K

56,453

28 May 2020

Scaling Laws for Neural Language Models

2.2K

7,434

23 Jan 2020

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

2.2K

2,633

17 Sep 2019

Attention Is All You NeedNeural Information Processing Systems (NeurIPS), 2017

8.3K

172,602

12 Jun 2017

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Piotr Dollár

794

4,041

08 Jun 2017

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts LayerInternational Conference on Learning Representations (ICLR), 2017

754

4,269

23 Jan 2017