31
1

Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs

Abstract

We consider the problem of learning a Constrained Markov Decision Process (CMDP) via general parameterization. Our proposed Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm uses entropy and quadratic regularizers to reach this goal. For a parameterized policy class with transferred compatibility approximation error, ϵbias\epsilon_{\mathrm{bias}}, PDR-ANPG achieves a last-iterate ϵ\epsilon optimality gap and ϵ\epsilon constraint violation (up to some additive factor of ϵbias\epsilon_{\mathrm{bias}}) with a sample complexity of O~(ϵ2min{ϵ2,ϵbias13})\tilde{\mathcal{O}}(\epsilon^{-2}\min\{\epsilon^{-2},\epsilon_{\mathrm{bias}}^{-\frac{1}{3}}\}). If the class is incomplete (ϵbias>0\epsilon_{\mathrm{bias}}>0), then the sample complexity reduces to O~(ϵ2)\tilde{\mathcal{O}}(\epsilon^{-2}) for ϵ<(ϵbias)16\epsilon<(\epsilon_{\mathrm{bias}})^{\frac{1}{6}}. Moreover, for complete policies with ϵbias=0\epsilon_{\mathrm{bias}}=0, our algorithm achieves a last-iterate ϵ\epsilon optimality gap and ϵ\epsilon constraint violation with O~(ϵ4)\tilde{\mathcal{O}}(\epsilon^{-4}) sample complexity. It is a significant improvement of the state-of-the-art last-iterate guarantees of general parameterized CMDPs.

View on arXiv
Comments on this paper