Depth Dependence of $μ$ P Learning Rates in ReLU MLPs

13 May 2023

Srinadh Bhojanapalli

Abstract

In this short note we consider random fully connected ReLU networks of width $n$ and depth $L$ equipped with a mean-field weight initialization. Our purpose is to study the dependence on $n$ and $L$ of the maximal update ( $\mu$ P) learning rate, the largest learning rate for which the mean squared change in pre-activations after one step of gradient descent remains uniformly bounded at large $n,L$ . As in prior work on $\mu$ P of Yang et. al., we find that this maximal update learning rate is independent of $n$ for all but the first and last layer weights. However, we find that it has a non-trivial dependence of $L$ , scaling like $L^{-3/2}.$

View on arXiv

Comments on this paper

Depth Dependence of μμμP Learning Rates in ReLU MLPs

Depth Dependence of $μ$ P Learning Rates in ReLU MLPs