188

Near-Optimal Sample Complexity Bounds for Constrained Average-Reward MDPs

Main:10 Pages
6 Figures
Bibliography:3 Pages
Appendix:17 Pages
Abstract

Recent advances have significantly improved our understanding of the sample complexity of learning in average-reward Markov decision processes (AMDPs) under the generative model. However, much less is known about the constrained average-reward MDP (CAMDP), where policies must satisfy long-run average constraints. In this work, we address this gap by studying the sample complexity of learning an ϵ\epsilon-optimal policy in CAMDPs under a generative model. We propose a model-based algorithm that operates under two settings: (i) relaxed feasibility, which allows small constraint violations, and (ii) strict feasibility, where the output policy satisfies the constraint. We show that our algorithm achieves sample complexities of O~(SA(B+H)ϵ2)\tilde{O}\left(\frac{S A (B+H)}{ \epsilon^2}\right) and O~(SA(B+H)ϵ2ζ2)\tilde{O} \left(\frac{S A (B+H)}{\epsilon^2 \zeta^2} \right) under the relaxed and strict feasibility settings, respectively. Here, ζ\zeta is the Slater constant indicating the size of the feasible region, HH is the span bound of the bias function, and BB is the transient time bound. Moreover, a matching lower bound of Ω~(SA(B+H)ϵ2ζ2)\tilde{\Omega}\left(\frac{S A (B+H)}{ \epsilon^2\zeta^2}\right) for the strict feasibility case is established, thus providing the first minimax-optimal bounds for CAMDPs. Our results close the theoretical gap in understanding the complexity of constrained average-reward MDPs.

View on arXiv
Comments on this paper