549

μμLO: Compute-Efficient Meta-Generalization of Learned Optimizers

Main:9 Pages
12 Figures
Bibliography:4 Pages
13 Tables
Appendix:17 Pages
Abstract

Learned optimizers (LOs) can significantly reduce the wall-clock training time of neural networks, substantially reducing training costs. However, they can struggle to optimize unseen tasks (meta-generalize), especially when training networks much larger than those seen during meta-training. To address this, we derive the Maximal Update Parametrization (μ\muP) for two popular learned optimizer architectures and propose a simple meta-training recipe for μ\mu-parameterized LOs (μ\muLOs). Our empirical evaluation demonstrates that LOs meta-trained with our recipe substantially improve meta-generalization to wider unseen tasks when compared to LOs trained under standard parametrization (e.g., as they are trained in existing work). When applying our μ\muLOs, each trained for less than 250 GPU-hours, to large-width models we are often able to match or exceed the performance of pre-trained VeLO, the most performant publicly available learned optimizer, meta-trained with 4000 TPU-months of compute. We also observe that learned optimizers trained with our μ\muLO recipe also exhibit substantially improved meta-generalization to deeper networks (5×5\times meta-training) and remarkable generalization to much longer training horizons (25×25\times meta-training).

View on arXiv
Comments on this paper