78

Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism

Main:18 Pages
2 Figures
Bibliography:1 Pages
2 Tables
Appendix:41 Pages
Abstract

This paper studies the safe reinforcement learning problem formulated as an episodic finite-horizon tabular constrained Markov decision process with an unknown transition kernel and stochastic reward and cost functions. We propose a model-based algorithm based on novel cost and reward function estimators that provide tighter cost pessimism and reward optimism. While guaranteeing no constraint violation in every episode, our algorithm achieves a regret upper bound of O~((CˉCˉb)1H2.5SAK)\widetilde{\mathcal{O}}((\bar C - \bar C_b)^{-1}H^{2.5} S\sqrt{AK}) where Cˉ\bar C is the cost budget for an episode, Cˉb\bar C_b is the expected cost under a safe baseline policy over an episode, HH is the horizon, and SS, AA and KK are the number of states, actions, and episodes, respectively. This improves upon the best-known regret upper bound, and when CˉCˉb=Ω(H)\bar C- \bar C_b=\Omega(H), it nearly matches the regret lower bound of Ω(H1.5SAK)\Omega(H^{1.5}\sqrt{SAK}). We deduce our cost and reward function estimators via a Bellman-type law of total variance to obtain tight bounds on the expected sum of the variances of value function estimates. This leads to a tighter dependence on the horizon in the function estimators. We also present numerical results to demonstrate the computational effectiveness of our proposed framework.

View on arXiv
Comments on this paper