Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs

We consider the problem of learning a Constrained Markov Decision Process (CMDP) via general parameterization. Our proposed Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm uses entropy and quadratic regularizers to reach this goal. For a parameterized policy class with transferred compatibility approximation error, , PDR-ANPG achieves a last-iterate optimality gap and constraint violation (up to some additive factor of ) with a sample complexity of . If the class is incomplete (), then the sample complexity reduces to for . Moreover, for complete policies with , our algorithm achieves a last-iterate optimality gap and constraint violation with sample complexity. It is a significant improvement of the state-of-the-art last-iterate guarantees of general parameterized CMDPs.
View on arXiv