Momentum Does Not Reduce Stochastic Noise in Stochastic Gradient Descent

4 February 2024

Naoki Sato

Hideaki Iiduka

ODL

ArXiv (abs)PDF HTML

Main:10 Pages

10 Figures

Bibliography:4 Pages

3 Tables

Appendix:11 Pages

Abstract

For nonconvex objective functions, including those found in training deep neural networks, stochastic gradient descent (SGD) with momentum is said to converge faster and have better generalizability than SGD without momentum. In particular, adding momentum is thought to reduce stochastic noise. To verify this, we estimated the magnitude of gradient noise by using convergence analysis and an optimal batch size estimation formula and found that momentum does not reduce gradient noise. We also analyzed the effect of search direction noise, which is stochastic noise defined as the error between the search direction of the optimizer and the steepest descent direction, and found that it inherently smooths the objective function and that momentum does not reduce search direction noise either. Finally, an analysis of the degree of smoothing introduced by search direction noise revealed that adding momentum offers limited advantage to SGD.

View on arXiv

Comments on this paper