$μ$ LO: Compute-Efficient Meta-Generalization of Learned Optimizers

31 May 2024

Benjamin Thérien

Charles-Étienne Joseph

ArXiv (abs)PDF HTML HuggingFace (13 upvotes)Github

Main:9 Pages

12 Figures

Bibliography:4 Pages

13 Tables

Appendix:17 Pages

Abstract

Learned optimizers (LOs) can significantly reduce the wall-clock training time of neural networks, substantially reducing training costs. However, they can struggle to optimize unseen tasks (meta-generalize), especially when training networks wider than those seen during meta-training. To address this, we derive the Maximal Update Parametrization ( $\mu$ P) for two state-of-the-art learned optimizer architectures and propose a simple meta-training recipe for $\mu$ -parameterized LOs ( $\mu$ LOs). Our empirical evaluation demonstrates that LOs meta-trained with our recipe substantially improve meta-generalization to wider unseen tasks when compared to LOs trained under standard parametrization (SP), as they are trained in existing work. We also empirically observe that $\mu$ LOs trained with our recipe exhibit unexpectedly improved meta-generalization to deeper networks ( $5\times$ meta-training) and surprising generalization to much longer training horizons ( $25\times$ meta-training) when compared to SP LOs.

View on arXiv

Comments on this paper

μμμLO: Compute-Efficient Meta-Generalization of Learned Optimizers

$μ$ LO: Compute-Efficient Meta-Generalization of Learned Optimizers