Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost
Pessimism and Reward Optimism
- OffRL
This paper studies the safe reinforcement learning problem formulated as an episodic finite-horizon tabular constrained Markov decision process with an unknown transition kernel and stochastic reward and cost functions. We propose a model-based algorithm based on novel cost and reward function estimators that provide tighter cost pessimism and reward optimism. While guaranteeing no constraint violation in every episode, our algorithm achieves a regret upper bound of where is the cost budget for an episode, is the expected cost under a safe baseline policy over an episode, is the horizon, and , and are the number of states, actions, and episodes, respectively. This improves upon the best-known regret upper bound, and when , it nearly matches the regret lower bound of . We deduce our cost and reward function estimators via a Bellman-type law of total variance to obtain tight bounds on the expected sum of the variances of value function estimates. This leads to a tighter dependence on the horizon in the function estimators. We also present numerical results to demonstrate the computational effectiveness of our proposed framework.
View on arXiv