Minimax Optimal Reinforcement Learning for Discounted MDPs

Neural Information Processing Systems (NeurIPS), 2020

1 October 2020

Quanquan Gu

Abstract

We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) in the tabular setting. We propose a model-based algorithm named UCBVI- $\gamma$ , which is based on the optimism in the face of uncertainty principle and the Bernstein-type bonus. It achieves $\tilde{O}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $\gamma$ is the discount factor and $T$ is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $\tilde{\Omega}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big)$ . Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI- $\gamma$ is near optimal for discounted MDPs.

View on arXiv

Comments on this paper