75
v1v2 (latest)

Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs

Main:8 Pages
Bibliography:3 Pages
2 Tables
Appendix:24 Pages
Abstract

We present two Policy Gradient-based algorithms with general parametrization in the context of infinite-horizon average reward Markov Decision Process (MDP). The first one employs Implicit Gradient Transport for variance reduction, ensuring an expected regret of the order O~(T2/3)\tilde{\mathcal{O}}(T^{2/3}). The second approach, rooted in Hessian-based techniques, ensures an expected regret of the order O~(T)\tilde{\mathcal{O}}(\sqrt{T}). These results significantly improve the state-of-the-art O~(T3/4)\tilde{\mathcal{O}}(T^{3/4}) regret and achieve the theoretical lower bound. We also show that the average-reward function is approximately LL-smooth, a result that was previously assumed in earlier works.

View on arXiv
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.