390

Optimal Sample Complexity for Average Reward Markov Decision Processes

International Conference on Learning Representations (ICLR), 2023
Abstract

We settle the sample complexity of policy learning for the maximization of the long run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of O~(SAtmix2ϵ2)\widetilde O(|S||A|t_{\text{mix}}^2 \epsilon^{-2}) and a lower bound of Ω(SAtmixϵ2)\Omega(|S||A|t_{\text{mix}} \epsilon^{-2}). In these expressions, S|S| and A|A| denote the cardinalities of the state and action spaces respectively, tmixt_{\text{mix}} serves as a uniform upper limit for the total variation mixing times, and ϵ\epsilon signifies the error tolerance. Therefore, a notable gap of tmixt_{\text{mix}} still remains to be bridged. Our primary contribution is to establish an estimator for the optimal policy of average reward MDPs with a sample complexity of O~(SAtmixϵ2)\widetilde O(|S||A|t_{\text{mix}}\epsilon^{-2}), effectively reaching the lower bound in the literature. This is achieved by combining algorithmic ideas in Jin and Sidford (2021) with those of Li et al. (2020).

View on arXiv
Comments on this paper