35
2

The Optimization Landscape of SGD Across the Feature Learning Strength

Abstract

We consider neural networks (NNs) where the final layer is down-scaled by a fixed hyperparameter γ\gamma. Recent work has identified γ\gamma as controlling the strength of feature learning. As γ\gamma increases, network evolution changes from "lazy" kernel dynamics to "rich" feature-learning dynamics, with a host of associated benefits including improved performance on common tasks. In this work, we conduct a thorough empirical investigation of the effect of scaling γ\gamma across a variety of models and datasets in the online training setting. We first examine the interaction of γ\gamma with the learning rate η\eta, identifying several scaling regimes in the γ\gamma-η\eta plane which we explain theoretically using a simple model. We find that the optimal learning rate η\eta^* scales non-trivially with γ\gamma. In particular, ηγ2\eta^* \propto \gamma^2 when γ1\gamma \ll 1 and ηγ2/L\eta^* \propto \gamma^{2/L} when γ1\gamma \gg 1 for a feed-forward network of depth LL. Using this optimal learning rate scaling, we proceed with an empirical study of the under-explored "ultra-rich" γ1\gamma \gg 1 regime. We find that networks in this regime display characteristic loss curves, starting with a long plateau followed by a drop-off, sometimes followed by one or more additional staircase steps. We find networks of different large γ\gamma values optimize along similar trajectories up to a reparameterization of time. We further find that optimal online performance is often found at large γ\gamma and could be missed if this hyperparameter is not tuned. Our findings indicate that analytical study of the large-γ\gamma limit may yield useful insights into the dynamics of representation learning in performant models.

View on arXiv
@article{atanasov2025_2410.04642,
  title={ The Optimization Landscape of SGD Across the Feature Learning Strength },
  author={ Alexander Atanasov and Alexandru Meterez and James B. Simon and Cengiz Pehlevan },
  journal={arXiv preprint arXiv:2410.04642},
  year={ 2025 }
}
Comments on this paper