40
0

Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model

Abstract

In this work, we study the sample complexity problem of risk-sensitive Reinforcement Learning (RL) with a generative model, where we aim to maximize the Conditional Value at Risk (CVaR) with risk tolerance level τ\tau at each step, a criterion we refer to as Iterated CVaR. We first build a connection between Iterated CVaR RL and (s,a)(s, a)-rectangular distributional robust RL with a specific uncertainty set for CVaR. We establish nearly matching upper and lower bounds on the sample complexity of this problem. Specifically, we first prove that a value iteration-based algorithm, ICVaR-VI, achieves an ϵ\epsilon-optimal policy with at most O~(SA(1γ)4τ2ϵ2)\tilde{O} \left(\frac{SA}{(1-\gamma)^4\tau^2\epsilon^2} \right) samples, where γ\gamma is the discount factor, and S,AS, A are the sizes of the state and action spaces. Furthermore, when τγ\tau \geq \gamma, the sample complexity improves to O~(SA(1γ)3ϵ2)\tilde{O} \left( \frac{SA}{(1-\gamma)^3\epsilon^2} \right). We further show a minimax lower bound of O~((1γτ)SA(1γ)4τϵ2)\tilde{O} \left(\frac{(1-\gamma \tau)SA}{(1-\gamma)^4\tau\epsilon^2} \right). For a fixed risk level τ(0,1]\tau \in (0,1], our upper and lower bounds match, demonstrating the tightness and optimality of our analysis. We also investigate a limiting case with a small risk level τ\tau, called Worst-Path RL, where the objective is to maximize the minimum possible cumulative reward. We develop matching upper and lower bounds of O~(SApmin)\tilde{O} \left(\frac{SA}{p_{\min}} \right), where pminp_{\min} denotes the minimum non-zero reaching probability of the transition kernel.

View on arXiv
@article{deng2025_2503.08934,
  title={ Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model },
  author={ Zilong Deng and Simon Khan and Shaofeng Zou },
  journal={arXiv preprint arXiv:2503.08934},
  year={ 2025 }
}
Comments on this paper