331

Nearly Horizon-Free Offline Reinforcement Learning

Neural Information Processing Systems (NeurIPS), 2021
Abstract

We revisit offline reinforcement learning on episodic time-homogeneous tabular Markov Decision Processes with SS states, AA actions and planning horizon HH. Given the collected NN episodes data with minimum cumulative reaching probability dmd_m, we obtain the first set of nearly HH-free sample complexity bounds for evaluation and planning using the empirical MDPs: 1.For the offline evaluation, we obtain an O~(1Ndm)\tilde{O}\left(\sqrt{\frac{1}{Nd_m}} \right) error rate, which matches the lower bound and does not have additional dependency on \poly(S,A)\poly\left(S,A\right) in higher-order term, that is different from previous works~\citep{yin2020near,yin2020asymptotically}. 2.For the offline policy optimization, we obtain an O~(1Ndm+SNdm)\tilde{O}\left(\sqrt{\frac{1}{Nd_m}} + \frac{S}{Nd_m}\right) error rate, improving upon the best known result by \cite{cui2020plug}, which has additional HH and SS factors in the main term. Furthermore, this bound approaches the Ω(1Ndm)\Omega\left(\sqrt{\frac{1}{Nd_m}}\right) lower bound up to logarithmic factors and a high-order term. To the best of our knowledge, these are the first set of nearly horizon-free bounds in offline reinforcement learning.

View on arXiv
Comments on this paper