284
v1v2v3v4v5 (latest)

On the Global Convergence of Risk-Averse Natural Policy Gradient Methods with Expected Conditional Risk Measures

International Conference on Machine Learning (ICML), 2023
Main:17 Pages
2 Figures
Bibliography:2 Pages
Appendix:13 Pages
Abstract

Risk-sensitive reinforcement learning (RL) has become a popular tool for controlling the risk of uncertain outcomes and ensuring reliable performance in highly stochastic sequential decision-making problems. While it has been shown that policy gradient methods can find globally optimal policies in the risk-neutral setting, it remains unclear if the risk-averse variants enjoy the same global convergence guarantees. In this paper, we consider a class of dynamic time-consistent risk measures, named Expected Conditional Risk Measures (ECRMs), and derive natural policy gradient (NPG) updates for ECRMs-based RL problems. We provide global optimality and iteration complexity of the proposed risk-averse NPG algorithm with softmax parameterization and entropy regularization under both exact and inexact policy evaluation. Furthermore, we test our risk-averse NPG algorithm on a stochastic Cliffwalk environment to demonstrate the efficacy of our method.

View on arXiv
Comments on this paper