(Accelerated) Noise-adaptive Stochastic Heavy-Ball Momentum

Stochastic heavy ball momentum (SHB) is commonly used to train machine learning models, and often provides empirical improvements over stochastic gradient descent. By primarily focusing on strongly-convex quadratics, we aim to better understand the theoretical advantage of SHB and subsequently improve the method. For strongly-convex quadratics, Kidambi et al. (2018) show that SHB (with a mini-batch of size ) cannot attain accelerated convergence, and hence has no theoretical benefit over SGD. They conjecture that the practical gain of SHB is a by-product of using larger mini-batches. We first substantiate this claim by showing that SHB can attain an accelerated rate when the mini-batch size is larger than a threshold that depends on the condition number . Specifically, we prove that with the same step-size and momentum parameters as in the deterministic setting, SHB with a sufficiently large mini-batch size results in an convergence, where is the number of iterations and is the variance in the stochastic gradients. We prove a lower-bound which demonstrates that a dependence in is necessary. To ensure convergence to the minimizer, we design a noise-adaptive multi-stage algorithm that results in an rate. We also consider the general smooth, strongly-convex setting and propose the first noise-adaptive SHB variant that converges to the minimizer at an rate. We empirically demonstrate the effectiveness of the proposed algorithms.
View on arXiv