292
v1v2v3v4 (latest)

The case for and against fixed step-size: Stochastic approximation algorithms in optimization and machine learning

Main:17 Pages
9 Figures
Bibliography:2 Pages
Appendix:18 Pages
Abstract

Theory and application of stochastic approximation (SA) have become increasingly relevant due in part to applications in optimization and reinforcement learning. This paper takes a new look at SA with constant step-size α>0\alpha>0, defined by the recursion, θn+1=θn+αf(θn,Φn+1)\theta_{n+1} = \theta_{n}+ \alpha f(\theta_n,\Phi_{n+1}) in which θnRd\theta_n\in\mathbb{R}^d and {Φn}\{\Phi_{n}\} is a Markov chain. The goal is to approximately solve root finding problem fˉ(θ)=0\bar{f}(\theta^*)=0, where fˉ(θ)=E[f(θ,Φ)]\bar{f}(\theta)=\mathbb{E}[f(\theta,\Phi)] and Φ\Phi has the steady-state distribution of {Φn}\{\Phi_{n}\}.The following conclusions are obtained under an ergodicity assumption on the Markov chain, compatible assumptions on ff, and for α>0\alpha>0 sufficiently small:1.\textbf{1.} The pair process {(θn,Φn)}\{(\theta_n,\Phi_n)\} is geometrically ergodic in a topological sense.2.\textbf{2.} For every 1p41\le p\le 4, there is a constant bpb_p such that lim supnE[θnθp]bpαp/2\limsup_{n\to\infty}\mathbb{E}[\|\theta_n-\theta^*\|^p]\le b_p \alpha^{p/2} for each initial condition.3.\textbf{3.} The Polyak-Ruppert-style averaged estimates θnPR=n1k=1nθk\theta^{\text{PR}}_n=n^{-1}\sum_{k=1}^{n}\theta_k converge to a limit θPR\theta^{\text{PR}}_\infty almost surely and in mean square, which satisfies θPR=θ+αΥˉ+O(α2)\theta^{\text{PR}}_\infty=\theta^*+\alpha \bar{\Upsilon}^*+O(\alpha^2) for an identified non-random ΥˉRd\bar{\Upsilon}^*\in\mathbb{R}^d. Moreover, the covariance is approximately optimal: The limiting covariance matrix of θnPR\theta^{\text {PR}}_n is approximately minimal in a matricial sense.The two main take-aways for practitioners are application-dependent. It is argued that, in applications to optimization, constant gain algorithms may be preferable even when the objective has multiple local minima; while a vanishing gain algorithm is preferable in applications to reinforcement learning due to the presence of bias.

View on arXiv
Comments on this paper