47

Decoupled Relative Learning Rate Schedules

Jan Ludziejewski
Jan Małaśnicki
Maciej Pióro
Michał Krutul
Kamil Ciebiera
Maciej Stefaniak
Jakub Krajewski
Piotr Sankowski
Marek Cygan
Kamil Adamczewski
Sebastian Jaszczur
Main:10 Pages
10 Figures
Bibliography:2 Pages
5 Tables
Appendix:3 Pages
Abstract

In this work, we introduce a novel approach for optimizing LLM training by adjusting learning rates across weights of different components in Transformer models. Traditional methods often apply a uniform learning rate across all network layers, potentially overlooking the unique dynamics of each part. Remarkably, our introduced relative learning rates, RLRS, method accelerates the training process by up to 23%23\%, particularly in complex models such as Mixture of Experts (MoE). Hyperparameters of RLRS can be efficiently tuned on smaller models and then effectively reused on models up to 27×27\times larger. This simple and effective method results in a substantial reduction in training time and computational resources, offering a practical and scalable solution for optimizing large-scale neural networks.

View on arXiv
Comments on this paper