13
21

Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies

Abstract

This paper gives the first polynomial-time algorithm for tabular Markov Decision Processes (MDP) that enjoys a regret bound \emph{independent on the planning horizon}. Specifically, we consider tabular MDP with SS states, AA actions, a planning horizon HH, total reward bounded by 11, and the agent plays for KK episodes. We design an algorithm that achieves an O(poly(S,A,logK)K)O\left(\mathrm{poly}(S,A,\log K)\sqrt{K}\right) regret in contrast to existing bounds which either has an additional polylog(H)\mathrm{polylog}(H) dependency~\citep{zhang2020reinforcement} or has an exponential dependency on SS~\citep{li2021settling}. Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies, which can have applications in other problems related to Markov chains.

View on arXiv
Comments on this paper