ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2312.14470
30
1

Safe Reinforcement Learning with Instantaneous Constraints: The Role of Aggressive Exploration

22 December 2023
Honghao Wei
Xin Liu
Lei Ying
ArXivPDFHTML
Abstract

This paper studies safe Reinforcement Learning (safe RL) with linear function approximation and under hard instantaneous constraints where unsafe actions must be avoided at each step. Existing studies have considered safe RL with hard instantaneous constraints, but their approaches rely on several key assumptions: (i)(i)(i) the RL agent knows a safe action set for {\it every} state or knows a {\it safe graph} in which all the state-action-state triples are safe, and (ii)(ii)(ii) the constraint/cost functions are {\it linear}. In this paper, we consider safe RL with instantaneous hard constraints without assumption (i)(i)(i) and generalize (ii)(ii)(ii) to Reproducing Kernel Hilbert Space (RKHS). Our proposed algorithm, LSVI-AE, achieves \cO~(d3H4K)\tilde{\cO}(\sqrt{d^3H^4K})\cO~​(d3H4K​) regret and \cO~(HdK)\tilde{\cO}(H \sqrt{dK})\cO~​(HdK​) hard constraint violation when the cost function is linear and \cO(HγKK)\cO(H\gamma_K \sqrt{K})\cO(HγK​K​) hard constraint violation when the cost function belongs to RKHS. Here KKK is the learning horizon, HHH is the length of each episode, and γK\gamma_KγK​ is the information gain w.r.t the kernel used to approximate cost functions. Our results achieve the optimal dependency on the learning horizon KKK, matching the lower bound we provide in this paper and demonstrating the efficiency of LSVI-AE. Notably, the design of our approach encourages aggressive policy exploration, providing a unique perspective on safe RL with general cost functions and no prior knowledge of safe actions, which may be of independent interest.

View on arXiv
Comments on this paper