12
0

(Accelerated) Noise-adaptive Stochastic Heavy-Ball Momentum

Abstract

Stochastic heavy ball momentum (SHB) is commonly used to train machine learning models, and often provides empirical improvements over stochastic gradient descent. By primarily focusing on strongly-convex quadratics, we aim to better understand the theoretical advantage of SHB and subsequently improve the method. For strongly-convex quadratics, Kidambi et al. (2018) show that SHB (with a mini-batch of size 11) cannot attain accelerated convergence, and hence has no theoretical benefit over SGD. They conjecture that the practical gain of SHB is a by-product of using larger mini-batches. We first substantiate this claim by showing that SHB can attain an accelerated rate when the mini-batch size is larger than a threshold bb^* that depends on the condition number κ\kappa. Specifically, we prove that with the same step-size and momentum parameters as in the deterministic setting, SHB with a sufficiently large mini-batch size results in an O(exp(Tκ)+σ)O\left(\exp(-\frac{T}{\sqrt{\kappa}}) + \sigma \right) convergence, where TT is the number of iterations and σ2\sigma^2 is the variance in the stochastic gradients. We prove a lower-bound which demonstrates that a κ\kappa dependence in bb^* is necessary. To ensure convergence to the minimizer, we design a noise-adaptive multi-stage algorithm that results in an O(exp(Tκ)+σT)O\left(\exp\left(-\frac{T}{\sqrt{\kappa}}\right) + \frac{\sigma}{T}\right) rate. We also consider the general smooth, strongly-convex setting and propose the first noise-adaptive SHB variant that converges to the minimizer at an O(exp(Tκ)+σ2T)O(\exp(-\frac{T}{\kappa}) + \frac{\sigma^2}{T}) rate. We empirically demonstrate the effectiveness of the proposed algorithms.

View on arXiv
Comments on this paper