This paper presents the first non-asymptotic result showing that a model-free algorithm can achieve a logarithmic cumulative regret for episodic tabular reinforcement learning if there exists a strictly positive sub-optimality gap in the optimal -function. We prove that the optimistic -learning studied in [Jin et al. 2018] enjoys a cumulative regret bound, where is the number of states, is the number of actions, is the planning horizon, is the total number of steps, and is the minimum sub-optimality gap. This bound matches the information theoretical lower bound in terms of up to a factor. We further extend our analysis to the discounted setting and obtain a similar logarithmic cumulative regret bound.
View on arXiv