Almost Sure Convergence of Average Reward Temporal Difference Learning

29 September 2024

Ethan Blaser

Shangtong Zhang

ArXiv (abs)PDF HTML Github

Main:6 Pages

Bibliography:3 Pages

1 Tables

Appendix:19 Pages

Abstract

Tabular average reward Temporal Difference (TD) learning is perhaps the simplest and the most fundamental policy evaluation algorithm in average reward reinforcement learning. After at least 25 years since its discovery, we are finally able to provide a long-awaited almost sure convergence analysis. Namely, we are the first to prove that, under very mild conditions, tabular average reward TD converges almost surely to a sample path dependent fixed point. Key to this success is a new general stochastic approximation result concerning nonexpansive mappings with Markovian and additive noise, built on recent advances in stochastic Krasnoselskii-Mann iterations.

View on arXiv

Comments on this paper