6
37

Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs

Abstract

We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a model-based algorithm named UCBVI-γ\gamma, which is based on the \emph{optimism in the face of uncertainty principle} and the Bernstein-type bonus. We show that UCBVI-γ\gamma achieves an O~(SAT/(1γ)1.5)\tilde{O}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big) regret, where SS is the number of states, AA is the number of actions, γ\gamma is the discount factor and TT is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least Ω~(SAT/(1γ)1.5)\tilde{\Omega}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big). Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-γ\gamma is nearly minimax optimal for discounted MDPs.

View on arXiv
Comments on this paper