130
v1v2 (latest)

Arithmetic-Mean μμP for Modern Architectures: A Unified Learning-Rate Scale for CNNs and ResNets

Main:9 Pages
10 Figures
Bibliography:2 Pages
Appendix:15 Pages
Abstract

Choosing an appropriate learning rate remains a key challenge in scaling depth of modern deep networks. The classical maximal update parameterization (μ\muP) enforces a fixed per-layer update magnitude, which is well suited to homogeneous multilayer perceptrons (MLPs) but becomes ill-posed in heterogeneous architectures where residual accumulation and convolutions introduce imbalance across layers. We introduce Arithmetic-Mean μ\muP (AM-μ\muP), which constrains not each individual layer but the network-wide average one-step pre-activation second moment to a constant scale. Combined with a residual-aware He fan-in initialization - scaling residual-branch weights by the number of blocks (Var[W]=c/(Kfan-in)\mathrm{Var}[W]=c/(K\cdot \mathrm{fan\text{-}in})) - AM-μ\muP yields width-robust depth laws that transfer consistently across depths. We prove that, for one- and two-dimensional convolutional networks, the maximal-update learning rate satisfies η(L)L3/2\eta^\star(L)\propto L^{-3/2}; with zero padding, boundary effects are constant-level as NkN\gg k. For standard residual networks with general conv+MLP blocks, we establish η(L)=Θ(L3/2)\eta^\star(L)=\Theta(L^{-3/2}), with LL the minimal depth. Empirical results across a range of depths confirm the 3/2-3/2 scaling law and enable zero-shot learning-rate transfer, providing a unified and practical LR principle for convolutional and deep residual networks without additional tuning overhead.

View on arXiv
Comments on this paper