We present an algorithm based on the \emph{Optimism in the Face of Uncertainty} (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently. By evaluating the state-pair difference of the optimal bias function , the proposed algorithm achieves a regret bound of \footnote{The symbol means with log factors ignored. } for MDP with states and actions, in the case that an upper bound on the span of , i.e., is known. This result outperforms the best previous regret bounds \citep{fruit2019improved} by a factor of . Furthermore, this regret bound matches the lower bound of \citep{jaksch2010near} up to a logarithmic factor. As a consequence, we show that there is a near optimal regret bound of for MDPs with a finite diameter compared to the lower bound of \citep{jaksch2010near}.
View on arXiv