The Optimization Landscape of SGD Across the Feature Learning Strength

We consider neural networks (NNs) where the final layer is down-scaled by a fixed hyperparameter . Recent work has identified as controlling the strength of feature learning. As increases, network evolution changes from "lazy" kernel dynamics to "rich" feature-learning dynamics, with a host of associated benefits including improved performance on common tasks. In this work, we conduct a thorough empirical investigation of the effect of scaling across a variety of models and datasets in the online training setting. We first examine the interaction of with the learning rate , identifying several scaling regimes in the - plane which we explain theoretically using a simple model. We find that the optimal learning rate scales non-trivially with . In particular, when and when for a feed-forward network of depth . Using this optimal learning rate scaling, we proceed with an empirical study of the under-explored "ultra-rich" regime. We find that networks in this regime display characteristic loss curves, starting with a long plateau followed by a drop-off, sometimes followed by one or more additional staircase steps. We find networks of different large values optimize along similar trajectories up to a reparameterization of time. We further find that optimal online performance is often found at large and could be missed if this hyperparameter is not tuned. Our findings indicate that analytical study of the large- limit may yield useful insights into the dynamics of representation learning in performant models.
View on arXiv@article{atanasov2025_2410.04642, title={ The Optimization Landscape of SGD Across the Feature Learning Strength }, author={ Alexander Atanasov and Alexandru Meterez and James B. Simon and Cengiz Pehlevan }, journal={arXiv preprint arXiv:2410.04642}, year={ 2025 } }