29
1

Revisiting Step-Size Assumptions in Stochastic Approximation

Abstract

Many machine learning and optimization algorithms are built upon the framework of stochastic approximation (SA), for which the selection of step-size (or learning rate) is essential for success. For the sake of clarity, this paper focuses on the special case αn=α0nρ\alpha_n = \alpha_0 n^{-\rho} at iteration nn, with ρ[0,1]\rho \in [0,1] and α0>0\alpha_0>0 design parameters. It is most common in practice to take ρ=0\rho=0 (constant step-size), while in more theoretically oriented papers a vanishing step-size is preferred. In particular, with ρ(1/2,1)\rho \in (1/2, 1) it is known that on applying the averaging technique of Polyak and Ruppert, the mean-squared error (MSE) converges at the optimal rate of O(1/n)O(1/n) and the covariance in the central limit theorem (CLT) is minimal in a precise sense. The paper revisits step-size selection in a general Markovian setting. Under readily verifiable assumptions, the following conclusions are obtained provided 0<ρ<10<\rho<1: \bullet Parameter estimates converge with probability one, and also in LpL_p for any p1p\ge 1. \bullet The MSE may converge very slowly for small ρ\rho, of order O(αn2)O(\alpha_n^2) even with averaging. \bullet For linear stochastic approximation the source of slow convergence is identified: for any ρ(0,1)\rho\in (0,1), averaging results in estimates for which the error covariance\textit{covariance} vanishes at the optimal rate, and moreover the CLT covariance is optimal in the sense of Polyak and Ruppert. However, necessary and sufficient conditions are obtained under which the bias\textit{bias} converges to zero at rate O(αn)O(\alpha_n). This is the first paper to obtain such strong conclusions while allowing for ρ1/2\rho \le 1/2. A major conclusion is that the choice of ρ=0\rho =0 or even ρ<1/2\rho<1/2 is justified only in select settings -- In general, bias may preclude fast convergence.

View on arXiv
Comments on this paper