Minimax Regret Bounds for Reinforcement Learning
We consider the problem of efficient exploration in finite horizon MDPs.We show that an optimistic modification to model-based value iteration, can achieve a regret bound where is the time horizon, the number of states, the number of actions and the time elapsed. This result improves over the best previous known bound achieved by the UCRL2 algorithm.The key significance of our new results is that when and , it leads to a regret of that matches the established lower bounds of up to a logarithmic factor. Our analysis contain two key insights. We use careful application of concentration inequalities to the optimal value function as a whole, rather than to the transitions probabilities (to improve scaling in ), and we use "exploration bonuses" based on Bernstein's inequality, together with using a recursive -Bellman-type- Law of Total Variance (to improve scaling in ).
View on arXiv