v1v2v3v4 (latest)

The case for and against fixed step-size: Stochastic approximation algorithms in optimization and machine learning

6 September 2023

Caio Kalil Lauand

Ioannis Kontoyiannis

Sean P. Meyn

ArXiv (abs)PDF HTML Github

Main:17 Pages

9 Figures

Bibliography:2 Pages

Appendix:18 Pages

Abstract

Theory and application of stochastic approximation (SA) have become increasingly relevant due in part to applications in optimization and reinforcement learning. This paper takes a new look at SA with constant step-size $\alpha>0$ , defined by the recursion, $\theta_{n+1} = \theta_{n}+ \alpha f(\theta_n,\Phi_{n+1})$ in which $\theta_n\in\mathbb{R}^d$ and $\{\Phi_{n}\}$ is a Markov chain. The goal is to approximately solve root finding problem $\bar{f}(\theta^*)=0$ , where $\bar{f}(\theta)=\mathbb{E}[f(\theta,\Phi)]$ and $\Phi$ has the steady-state distribution of $\{\Phi_{n}\}$ .The following conclusions are obtained under an ergodicity assumption on the Markov chain, compatible assumptions on $f$ , and for $\alpha>0$ sufficiently small: $\textbf{1.}$ The pair process $\{(\theta_n,\Phi_n)\}$ is geometrically ergodic in a topological sense. $\textbf{2.}$ For every $1\le p\le 4$ , there is a constant $b_p$ such that $\limsup_{n\to\infty}\mathbb{E}[\|\theta_n-\theta^*\|^p]\le b_p \alpha^{p/2}$ for each initial condition. $\textbf{3.}$ The Polyak-Ruppert-style averaged estimates $\theta^{\text{PR}}_n=n^{-1}\sum_{k=1}^{n}\theta_k$ converge to a limit $\theta^{\text{PR}}_\infty$ almost surely and in mean square, which satisfies $\theta^{\text{PR}}_\infty=\theta^*+\alpha \bar{\Upsilon}^*+O(\alpha^2)$ for an identified non-random $\bar{\Upsilon}^*\in\mathbb{R}^d$ . Moreover, the covariance is approximately optimal: The limiting covariance matrix of $\theta^{\text {PR}}_n$ is approximately minimal in a matricial sense.The two main take-aways for practitioners are application-dependent. It is argued that, in applications to optimization, constant gain algorithms may be preferable even when the objective has multiple local minima; while a vanishing gain algorithm is preferable in applications to reinforcement learning due to the presence of bias.

View on arXiv

Comments on this paper