34
0

Temperature is All You Need for Generalization in Langevin Dynamics and other Markov Processes

Abstract

We analyze the generalization gap (gap between the training and test errors) when training a potentially over-parametrized model using a Markovian stochastic training algorithm, initialized from some distribution θ0p0\theta_0 \sim p_0. We focus on Langevin dynamics with a positive temperature β1\beta^{-1}, i.e. gradient descent on a training loss LL with infinitesimal step size, perturbed with β1\beta^{-1}-variances Gaussian noise, and lightly regularized or bounded. There, we bound the generalization gap, at any time during training, by (βEL(θ0)+log(1/δ))/N\sqrt{(\beta\mathbb{E} L (\theta_0) + \log(1/\delta))/N} with probability 1δ1-\delta over the dataset, where NN is the sample size, and EL(θ0)=O(1)\mathbb{E} L (\theta_0) =O(1) with standard initialization scaling. In contrast to previous guarantees, we have no dependence on either training time or reliance on mixing, nor a dependence on dimensionality, gradient norms, or any other properties of the loss or model. This guarantee follows from a general analysis of any Markov process-based training that has a Gibbs-style stationary distribution. The proof is surprisingly simple, once we observe that the marginal distribution divergence from initialization remains bounded, as implied by a generalized second law of thermodynamics.

View on arXiv
@article{harel2025_2505.19087,
  title={ Temperature is All You Need for Generalization in Langevin Dynamics and other Markov Processes },
  author={ Itamar Harel and Yonathan Wolanowsky and Gal Vardi and Nathan Srebro and Daniel Soudry },
  journal={arXiv preprint arXiv:2505.19087},
  year={ 2025 }
}
Comments on this paper