ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2106.06680
6
6

Markov Decision Processes with Long-Term Average Constraints

12 June 2021
Mridul Agarwal
Qinbo Bai
Vaneet Aggarwal
ArXivPDFHTML
Abstract

We consider the problem of constrained Markov Decision Process (CMDP) where an agent interacts with a unichain Markov Decision Process. At every interaction, the agent obtains a reward. Further, there are KKK cost functions. The agent aims to maximize the long-term average reward while simultaneously keeping the KKK long-term average costs lower than a certain threshold. In this paper, we propose CMDP-PSRL, a posterior sampling based algorithm using which the agent can learn optimal policies to interact with the CMDP. Further, for MDP with SSS states, AAA actions, and diameter DDD, we prove that following CMDP-PSRL algorithm, the agent can bound the regret of not accumulating rewards from optimal policy by \TildeO(poly(DSA)T)\Tilde{O}(poly(DSA)\sqrt{T})\TildeO(poly(DSA)T​). Further, we show that the violations for any of the KKK constraints is also bounded by \TildeO(poly(DSA)T)\Tilde{O}(poly(DSA)\sqrt{T})\TildeO(poly(DSA)T​). To the best of our knowledge, this is the first work which obtains a \TildeO(T)\Tilde{O}(\sqrt{T})\TildeO(T​) regret bounds for ergodic MDPs with long-term average constraints.

View on arXiv
Comments on this paper