142
v1v2v3v4 (latest)

Policy Zooming: Adaptive Discretization-based Infinite-Horizon Average-Reward Reinforcement Learning

Main:7 Pages
3 Figures
Bibliography:2 Pages
Appendix:29 Pages
Abstract

We study the infinite-horizon average-reward reinforcement learning (RL) for continuous space Lipschitz MDPs in which an agent can play policies from a given set Φ\Phi. The proposed algorithms efficiently explore the policy space by ''zooming'' into the ''promising regions'' of Φ\Phi, thereby achieving adaptivity gains in the performance. We upper bound their regret as O~(T1deff.1)\tilde{\mathcal{O}}\big(T^{1 - d_{\text{eff.}}^{-1}}\big), where deff.=dzΦ+2d_{\text{eff.}} = d^\Phi_z+2 for model-free algoritahm PZRL-MF\textit{PZRL-MF} and deff.=2dS+dzΦ+3d_{\text{eff.}} = 2d_\mathcal{S} + d^\Phi_z + 3 for model-based algorithm PZRL-MB\textit{PZRL-MB}. Here, dSd_\mathcal{S} is the dimension of the state space, and dzΦd^\Phi_z is the zooming dimension given a set of policies Φ\Phi. dzΦd^\Phi_z is an alternative measure of the complexity of the problem, and it depends on the underlying MDP as well as on Φ\Phi. Hence, the proposed algorithms exhibit low regret in case the problem instance is benign and/or the agent competes against a low-complexity Φ\Phi (that has a small dzΦd^\Phi_z). When specialized to the case of finite-dimensional policy space, we obtain that deff.d_{\text{eff.}} scales as the dimension of this space under mild technical conditions; and also obtain deff.=2d_{\text{eff.}} = 2, or equivalently O~(T)\tilde{\mathcal{O}}(\sqrt{T}) regret for PZRL-MF\textit{PZRL-MF}, under a curvature condition on the average reward function that is commonly used in the multi-armed bandit (MAB) literature.

View on arXiv
Comments on this paper