547
v1v2v3v4 (latest)

μμLO: Compute-Efficient Meta-Generalization of Learned Optimizers

Main:9 Pages
12 Figures
Bibliography:4 Pages
13 Tables
Appendix:17 Pages
Abstract

Learned optimizers (LOs) have the potential to significantly reduce the wall-clock training time of neural networks. However, they can struggle to optimize unseen tasks (\emph{meta-generalize}), especially when training networks wider than those seen during meta-training. To address this, we derive the Maximal Update Parametrization (μ\muP) for two state-of-the-art learned optimizer architectures and propose a simple meta-training recipe for μ\mu-parameterized LOs (μ\muLOs). Our empirical evaluation demonstrates that LOs meta-trained with our recipe substantially improve meta-generalization to wider unseen tasks when compared to LOs trained under standard parametrization (SP) using the same compute budget. We also empirically observe that μ\muLOs exhibit unexpectedly improved meta-generalization to deeper networks (5×5\times meta-training) and surprising generalization to much longer training horizons (25×25\times meta-training) when compared to SP LOs.

View on arXiv
Comments on this paper