378

Minimax Regret Bounds for Reinforcement Learning

Abstract

We consider the problem of efficient exploration in finite horizon MDPs.We show that an optimistic modification to model-based value iteration, can achieve a regret bound O~(HSAT+H2S2A+HT)\tilde{O}( \sqrt{HSAT} + H^2S^2A+H\sqrt{T}) where HH is the time horizon, SS the number of states, AA the number of actions and TT the time elapsed. This result improves over the best previous known bound O~(HSAT)\tilde{O}(HS \sqrt{AT}) achieved by the UCRL2 algorithm.The key significance of our new results is that when TH3S3AT\geq H^3S^3A and SAHSA\geq H, it leads to a regret of O~(HSAT)\tilde{O}(\sqrt{HSAT}) that matches the established lower bounds of Ω(HSAT)\Omega(\sqrt{HSAT}) up to a logarithmic factor. Our analysis contain two key insights. We use careful application of concentration inequalities to the optimal value function as a whole, rather than to the transitions probabilities (to improve scaling in SS), and we use "exploration bonuses" based on Bernstein's inequality, together with using a recursive -Bellman-type- Law of Total Variance (to improve scaling in HH).

View on arXiv
Comments on this paper