v1v2v3 (latest)

u- $\mu$ P: The Unit-Scaled Maximal Update Parametrization

24 July 2024

Andres Felipe Cruz Salinas

Carlo Luschi

Samuel Weinbach

Douglas Orr

ArXiv (abs)PDF HTML

Abstract

The Maximal Update Parametrization ( $\mu$ P) aims to make the optimal hyperparameters (HPs) of a model independent of its size, allowing them to be swept using a cheap proxy model rather than the full-size target model. We present a new scheme, u- $\mu$ P, which improves upon $\mu$ P by combining it with Unit Scaling, a method for designing models that makes them easy to train in low-precision. The two techniques have a natural affinity: $\mu$ P ensures that the scale of activations is independent of model size, and Unit Scaling ensures that activations, weights and gradients begin training with a scale of one. This synthesis opens the door to a simpler scheme, whose default values are near-optimal. This in turn facilitates a more efficient sweeping strategy, with u- $\mu$ P models reaching a loss that is equal to or lower than comparable $\mu$ P models and working out-of-the-box in FP8.

View on arXiv

Comments on this paper

u-μ\muμP: The Unit-Scaled Maximal Update Parametrization

u- $\mu$ P: The Unit-Scaled Maximal Update Parametrization