Policy Zooming: Adaptive Discretization-based Infinite-Horizon Average-Reward Reinforcement Learning
We study the infinite-horizon average-reward reinforcement learning (RL) for continuous space Lipschitz MDPs in which an agent can play policies from a given set . The proposed algorithms efficiently explore the policy space by ''zooming'' into the ''promising regions'' of , thereby achieving adaptivity gains in the performance. We upper bound their regret as , where for model-free algoritahm and for model-based algorithm . Here, is the dimension of the state space, and is the zooming dimension given a set of policies . is an alternative measure of the complexity of the problem, and it depends on the underlying MDP as well as on . Hence, the proposed algorithms exhibit low regret in case the problem instance is benign and/or the agent competes against a low-complexity (that has a small ). When specialized to the case of finite-dimensional policy space, we obtain that scales as the dimension of this space under mild technical conditions; and also obtain , or equivalently regret for , under a curvature condition on the average reward function that is commonly used in the multi-armed bandit (MAB) literature.
View on arXiv