17
1

Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs

Abstract

This paper considers the best policy identification (BPI) problem in online Constrained Markov Decision Processes (CMDPs). We are interested in algorithms that are model-free, have low regret, and identify an approximately optimal policy with a high probability. Existing model-free algorithms for online CMDPs with sublinear regret and constraint violation do not provide any convergence guarantee to an optimal policy and provide only average performance guarantees when a policy is uniformly sampled at random from all previously used policies. In this paper, we develop a new algorithm, named Pruning-Refinement-Identification (PRI), based on a fundamental structural property of CMDPs proved before, which we call limited stochasticity. The property says for a CMDP with NN constraints, there exists an optimal policy with at most NN stochastic decisions. The proposed algorithm first identifies at which step and in which state a stochastic decision has to be taken and then fine-tunes the distributions of these stochastic decisions. PRI achieves trio objectives: (i) PRI is a model-free algorithm; and (ii) it outputs an approximately optimal policy with a high probability at the end of learning; and (iii) PRI guarantees O~(HK)\tilde{\mathcal{O}}(H\sqrt{K}) regret and constraint violation, which significantly improves the best existing regret bound O~(H4SAK45)\tilde{\mathcal{O}}(H^4 \sqrt{SA}K^{\frac{4}{5}}) under a model-free algorithm, where HH is the length of each episode, SS is the number of states, AA is the number of actions, and the total number of episodes during learning is 2K+O~(K0.25).2K+\tilde{\cal O}(K^{0.25}). We further present a matching lower via an example that shows under any online learning algorithm, there exists a well-separated CMDP instance such that either the regret or violation has to be Ω(HK),\Omega(H\sqrt{K}), which matches the upper bound by a polylogarithmic factor.

View on arXiv
Comments on this paper