The Optimization Landscape of SGD Across the Feature Learning Strength

6 October 2024

Abstract

We consider neural networks (NNs) where the final layer is down-scaled by a fixed hyperparameter $\gamma$ . Recent work has identified $\gamma$ as controlling the strength of feature learning. As $\gamma$ increases, network evolution changes from "lazy" kernel dynamics to "rich" feature-learning dynamics, with a host of associated benefits including improved performance on common tasks. In this work, we conduct a thorough empirical investigation of the effect of scaling $\gamma$ across a variety of models and datasets in the online training setting. We first examine the interaction of $\gamma$ with the learning rate $\eta$ , identifying several scaling regimes in the $\gamma$ - $\eta$ plane which we explain theoretically using a simple model. We find that the optimal learning rate $\eta^*$ scales non-trivially with $\gamma$ . In particular, $\eta^* \propto \gamma^2$ when $\gamma \ll 1$ and $\eta^* \propto \gamma^{2/L}$ when $\gamma \gg 1$ for a feed-forward network of depth $L$ . Using this optimal learning rate scaling, we proceed with an empirical study of the under-explored "ultra-rich" $\gamma \gg 1$ regime. We find that networks in this regime display characteristic loss curves, starting with a long plateau followed by a drop-off, sometimes followed by one or more additional staircase steps. We find networks of different large $\gamma$ values optimize along similar trajectories up to a reparameterization of time. We further find that optimal online performance is often found at large $\gamma$ and could be missed if this hyperparameter is not tuned. Our findings indicate that analytical study of the large- $\gamma$ limit may yield useful insights into the dynamics of representation learning in performant models.

View on arXiv

@article{atanasov2025_2410.04642,
  title={ The Optimization Landscape of SGD Across the Feature Learning Strength },
  author={ Alexander Atanasov and Alexandru Meterez and James B. Simon and Cengiz Pehlevan },
  journal={arXiv preprint arXiv:2410.04642},
  year={ 2025 }
}

Comments on this paper