405

Critic-Actor for Average Reward MDPs with Function Approximation: A Finite-Time Analysis

Main:7 Pages
2 Figures
Bibliography:1 Pages
10 Tables
Appendix:36 Pages
Abstract

In recent years, there has been a lot of research work activity focused on carrying out asymptotic and non-asymptotic convergence analyses for two-timescale actor critic algorithms where the actor updates are performed on a timescale that is slower than that of the critic. In a recent work, the critic-actor algorithm has been presented for the infinite horizon discounted cost setting in the look-up table case where the timescales of the actor and the critic are reversed and asymptotic convergence analysis has been presented. In our work, we present the first critic-actor algorithm with function approximation and in the long-run average reward setting and present the first finite-time (non-asymptotic) analysis of such a scheme. We obtain optimal learning rates and prove that our algorithm achieves a sample complexity of O~(ϵ2.08)\mathcal{\tilde{O}}(\epsilon^{-2.08}) for the mean squared error of the critic to be upper bounded by ϵ\epsilon which is better than the one obtained for actor-critic in a similar setting. We also show the results of numerical experiments on three benchmark settings and observe that the critic-actor algorithm competes well with the actor-critic algorithm.

View on arXiv
Comments on this paper