Regret Analysis of Unichain Average Reward Constrained MDPs with General Parameterization

8 February 2026

Anirudh Satheesh

Vaneet Aggarwal

ArXiv (abs)PDF HTML Github

Main:14 Pages

Bibliography:5 Pages

1 Tables

Appendix:20 Pages

Abstract

We study infinite-horizon average-reward constrained Markov decision processes (CMDPs) under the unichain assumption and general policy parameterizations. Existing regret analyses for constrained reinforcement learning largely rely on ergodicity or strong mixing-time assumptions, which fail to hold in the presence of transient states. We propose a primal--dual natural actor--critic algorithm that leverages multi-level Monte Carlo (MLMC) estimators and an explicit burn-in mechanism to handle unichain dynamics without requiring mixing-time oracles. Our analysis establishes finite-time regret and cumulative constraint violation bounds that scale as $\tilde{O}(\sqrt{T})$ , up to approximation errors arising from policy and critic parameterization, thereby extending order-optimal guarantees to a significantly broader class of CMDPs.

View on arXiv

Comments on this paper