v1v2v3v4 (latest)

Policy Zooming: Adaptive Discretization-based Infinite-Horizon Average-Reward Reinforcement Learning

29 May 2024

Avik Kar

Rahul Singh

ArXiv (abs)PDF HTML Github

Main:7 Pages

3 Figures

Bibliography:2 Pages

Appendix:29 Pages

Abstract

We study the infinite-horizon average-reward reinforcement learning (RL) for continuous space Lipschitz MDPs in which an agent can play policies from a given set $\Phi$ . The proposed algorithms efficiently explore the policy space by ''zooming'' into the ''promising regions'' of $\Phi$ , thereby achieving adaptivity gains in the performance. We upper bound their regret as $\tilde{\mathcal{O}}\big(T^{1 - d_{\text{eff.}}^{-1}}\big)$ , where $d_{\text{eff.}} = d^\Phi_z+2$ for model-free algoritahm $\textit{PZRL-MF}$ and $d_{\text{eff.}} = 2d_\mathcal{S} + d^\Phi_z + 3$ for model-based algorithm $\textit{PZRL-MB}$ . Here, $d_\mathcal{S}$ is the dimension of the state space, and $d^\Phi_z$ is the zooming dimension given a set of policies $\Phi$ . $d^\Phi_z$ is an alternative measure of the complexity of the problem, and it depends on the underlying MDP as well as on $\Phi$ . Hence, the proposed algorithms exhibit low regret in case the problem instance is benign and/or the agent competes against a low-complexity $\Phi$ (that has a small $d^\Phi_z$ ). When specialized to the case of finite-dimensional policy space, we obtain that $d_{\text{eff.}}$ scales as the dimension of this space under mild technical conditions; and also obtain $d_{\text{eff.}} = 2$ , or equivalently $\tilde{\mathcal{O}}(\sqrt{T})$ regret for $\textit{PZRL-MF}$ , under a curvature condition on the average reward function that is commonly used in the multi-armed bandit (MAB) literature.

View on arXiv

Comments on this paper