ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2006.05961
15
1

Model-Free Algorithm and Regret Analysis for MDPs with Long-Term Constraints

10 June 2020
Qinbo Bai
Vaneet Aggarwal
Ather Gattami
ArXivPDFHTML
Abstract

In the optimization of dynamical systems, the variables typically have constraints. Such problems can be modeled as a constrained Markov Decision Process (CMDP). This paper considers a model-free approach to the problem, where the transition probabilities are not known. In the presence of long-term (or average) constraints, the agent has to choose a policy that maximizes the long-term average reward as well as satisfy the average constraints in each episode. The key challenge with the long-term constraints is that the optimal policy is not deterministic in general, and thus standard Q-learning approaches cannot be directly used. This paper uses concepts from constrained optimization and Q-learning to propose an algorithm for CMDP with long-term constraints. For any γ∈(0,12)\gamma\in(0,\frac{1}{2})γ∈(0,21​), the proposed algorithm is shown to achieve O(T1/2+γ)O(T^{1/2+\gamma})O(T1/2+γ) regret bound for the obtained reward and O(T1−γ/2)O(T^{1-\gamma/2})O(T1−γ/2) regret bound for the constraint violation, where TTT is the total number of steps. We note that these are the first results on regret analysis for MDP with long-term constraints, where the transition probabilities are not known apriori.

View on arXiv
Comments on this paper