v1v2v3v4v5v6v7v8v9v10v11v12v13v14 (latest)

Two Timescale Stochastic Approximation with Controlled Markov noise and Off-policy temporal difference learning

31 March 2015

Abstract

We present for the first time an asymptotic convergence analysis of two-timescale stochastic approximation driven by controlled Markov noise. In particular, both the faster and slower recursions have non-additive Markov noise components in addition to martingale difference noise. We analyze the asymptotic behavior of our framework by relating it to limiting differential inclusions in both time-scales that are defined in terms of the invariant probability measures associated with the controlled Markov processes. Finally, we show how to solve the off-policy convergence problem for temporal difference learning with linear function approximation using our results and proving stability of the iterates in this case. Moreover, in general, we emphasize the fact that all the reinforcement learning scenarios where function approximation of value function is deployed needs to consider Markov noise in the convergence proofs.

View on arXiv

Comments on this paper