Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies

This paper gives the first polynomial-time algorithm for tabular Markov Decision Processes (MDP) that enjoys a regret bound \emph{independent on the planning horizon}. Specifically, we consider tabular MDP with states, actions, a planning horizon , total reward bounded by , and the agent plays for episodes. We design an algorithm that achieves an regret in contrast to existing bounds which either has an additional dependency~\citep{zhang2020reinforcement} or has an exponential dependency on ~\citep{li2021settling}. Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies, which can have applications in other problems related to Markov chains.
View on arXiv