Optimal Sample Complexity for Average Reward Markov Decision Processes
We settle the sample complexity of policy learning for the maximization of the long run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of and a lower bound of . In these expressions, and denote the cardinalities of the state and action spaces respectively, serves as a uniform upper limit for the total variation mixing times, and signifies the error tolerance. Therefore, a notable gap of still remains to be bridged. Our primary contribution is to establish an estimator for the optimal policy of average reward MDPs with a sample complexity of , effectively reaching the lower bound in the literature. This is achieved by combining algorithmic ideas in Jin and Sidford (2021) with those of Li et al. (2020).
View on arXiv