555

μμLO: Compute-Efficient Meta-Generalization of Learned Optimizers

Main:9 Pages
12 Figures
Bibliography:4 Pages
13 Tables
Appendix:17 Pages
Abstract

Learned optimizers (LOs) can significantly reduce the wall-clock training time of neural networks, substantially reducing training costs. However, they can struggle to optimize unseen tasks (meta-generalize), especially when training networks wider than those seen during meta-training. To address this, we derive the Maximal Update Parametrization (μ\muP) for two state-of-the-art learned optimizer architectures and propose a simple meta-training recipe for μ\mu-parameterized LOs (μ\muLOs). Our empirical evaluation demonstrates that LOs meta-trained with our recipe substantially improve meta-generalization to wider unseen tasks when compared to LOs trained under standard parametrization (SP), as they are trained in existing work. We also empirically observe that μ\muLOs trained with our recipe exhibit unexpectedly improved meta-generalization to deeper networks (5×5\times meta-training) and surprising generalization to much longer training horizons (25×25\times meta-training) when compared to SP LOs.

View on arXiv
Comments on this paper