Policy Gradient Methods for Risk-Sensitive Distributional Reinforcement Learning with Provable Convergence

Risk-sensitive reinforcement learning (RL) is crucial for maintaining reliable performance in high-stakes applications. While traditional RL methods aim to learn a point estimate of the random cumulative cost, distributional RL (DRL) seeks to estimate the entire distribution of it, which leads to a unified framework for handling different risk measures. However, developing policy gradient methods for risk-sensitive DRL is inherently more complex as it involves finding the gradient of a probability measure. This paper introduces a new policy gradient method for risk-sensitive DRL with general coherent risk measures, where we provide an analytical form of the probability measure's gradient for any distribution. For practical use, we design a categorical distributional policy gradient algorithm (CDPG) that approximates any distribution by a categorical family supported on some fixed points. We further provide a finite-support optimality guarantee and a finite-iteration convergence guarantee under inexact policy evaluation and gradient estimation. Through experiments on stochastic Cliffwalk and CartPole environments, we illustrate the benefits of considering a risk-sensitive setting in DRL.
View on arXiv