414

Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds

Journal of machine learning research (JMLR), 2022
Abstract

We study the regret guarantee for risk-sensitive reinforcement learning (RSRL) via distributional reinforcement learning (DRL) methods. In particular, we consider finite episodic Markov decision processes whose objective is the entropic risk measure (EntRM) of return. We identify a key property of the EntRM, the monotonicity-preserving property, which enables the risk-sensitive distributional dynamic programming framework. We then propose two novel DRL algorithms that implement optimism through two different schemes, including a model-free one and a model-based one. We prove that both of them attain O~(exp(βH)1βHHHS2AT)\tilde{\mathcal{O}}(\frac{\exp(|\beta| H)-1}{|\beta|H}H\sqrt{HS^2AT}) regret upper bound, where SS is the number of states, AA the number of states, HH the time horizon and TT the number of total time steps. It matches RSVI2 proposed in \cite{fei2021exponential} with a much simpler regret analysis. To the best of our knowledge, this is the first regret analysis of DRL, which bridges DRL and RSRL in terms of sample complexity. Finally, we improve the existing lower bound by proving a tighter bound of Ω(exp(βH/6)1βHHSAT)\Omega(\frac{\exp(\beta H/6)-1}{\beta H}H\sqrt{SAT}) for β>0\beta>0 case, which recovers the tight lower bound Ω(HSAT)\Omega(H\sqrt{SAT}) in the risk-neutral setting.

View on arXiv
Comments on this paper