0
0

Analysis of an Idealized Stochastic Polyak Method and its Application to Black-Box Model Distillation

Robert M. Gower
Guillaume Garrigos
Nicolas Loizou
Dimitris Oikonomou
Konstantin Mishchenko
Fabian Schaipp
Abstract

We provide a general convergence theorem of an idealized stochastic Polyak step size called SPS^*. Besides convexity, we only assume a local expected gradient bound, that includes locally smooth and locally Lipschitz losses as special cases. We refer to SPS^* as idealized because it requires access to the loss for every training batch evaluated at a solution. It is also ideal, in that it achieves the optimal lower bound for globally Lipschitz function, and is the first Polyak step size to have an O(1/t)O(1/\sqrt{t}) anytime convergence in the smooth setting. We show how to combine SPS^* with momentum to achieve the same favorable rates for the last iterate. We conclude with several experiments to validate our theory, and a more practical setting showing how we can distill a teacher GPT-2 model into a smaller student model without any hyperparameter tuning.

View on arXiv
@article{gower2025_2504.01898,
  title={ Analysis of an Idealized Stochastic Polyak Method and its Application to Black-Box Model Distillation },
  author={ Robert M. Gower and Guillaume Garrigos and Nicolas Loizou and Dimitris Oikonomou and Konstantin Mishchenko and Fabian Schaipp },
  journal={arXiv preprint arXiv:2504.01898},
  year={ 2025 }
}
Comments on this paper