13

Regret Analysis of Unichain Average Reward Constrained MDPs with General Parameterization

Anirudh Satheesh
Vaneet Aggarwal
Main:14 Pages
Bibliography:5 Pages
1 Tables
Appendix:20 Pages
Abstract

We study infinite-horizon average-reward constrained Markov decision processes (CMDPs) under the unichain assumption and general policy parameterizations. Existing regret analyses for constrained reinforcement learning largely rely on ergodicity or strong mixing-time assumptions, which fail to hold in the presence of transient states. We propose a primal--dual natural actor--critic algorithm that leverages multi-level Monte Carlo (MLMC) estimators and an explicit burn-in mechanism to handle unichain dynamics without requiring mixing-time oracles. Our analysis establishes finite-time regret and cumulative constraint violation bounds that scale as O~(T)\tilde{O}(\sqrt{T}), up to approximation errors arising from policy and critic parameterization, thereby extending order-optimal guarantees to a significantly broader class of CMDPs.

View on arXiv
Comments on this paper