Analysis of an Idealized Stochastic Polyak Method and its Application to Black-Box Model Distillation

We provide a general convergence theorem of an idealized stochastic Polyak step size called SPS. Besides convexity, we only assume a local expected gradient bound, that includes locally smooth and locally Lipschitz losses as special cases. We refer to SPS as idealized because it requires access to the loss for every training batch evaluated at a solution. It is also ideal, in that it achieves the optimal lower bound for globally Lipschitz function, and is the first Polyak step size to have an anytime convergence in the smooth setting. We show how to combine SPS with momentum to achieve the same favorable rates for the last iterate. We conclude with several experiments to validate our theory, and a more practical setting showing how we can distill a teacher GPT-2 model into a smaller student model without any hyperparameter tuning.
View on arXiv@article{gower2025_2504.01898, title={ Analysis of an Idealized Stochastic Polyak Method and its Application to Black-Box Model Distillation }, author={ Robert M. Gower and Guillaume Garrigos and Nicolas Loizou and Dimitris Oikonomou and Konstantin Mishchenko and Fabian Schaipp }, journal={arXiv preprint arXiv:2504.01898}, year={ 2025 } }