22
12

Towards Minimax Optimality of Model-based Robust Reinforcement Learning

Abstract

We study the sample complexity of obtaining an ϵ\epsilon-optimal policy in \emph{Robust} discounted Markov Decision Processes (RMDPs), given only access to a generative model of the nominal kernel. This problem is widely studied in the non-robust case, and it is known that any planning approach applied to an empirical MDP estimated with O~(H3SAϵ2)\tilde{\mathcal{O}}(\frac{H^3 \mid S \mid\mid A \mid}{\epsilon^2}) samples provides an ϵ\epsilon-optimal policy, which is minimax optimal. Results in the robust case are much more scarce. For sasa- (resp ss-)rectangular uncertainty sets, the best known sample complexity is O~(H4S2Aϵ2)\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid^2\mid A \mid}{\epsilon^2}) (resp. O~(H4S2A2ϵ2)\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid^2\mid A \mid^2}{\epsilon^2})), for specific algorithms and when the uncertainty set is based on the total variation (TV), the KL or the Chi-square divergences. In this paper, we consider uncertainty sets defined with an LpL_p-ball (recovering the TV case), and study the sample complexity of \emph{any} planning algorithm (with high accuracy guarantee on the solution) applied to an empirical RMDP estimated using the generative model. In the general case, we prove a sample complexity of O~(H4SAϵ2)\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid\mid A \mid}{\epsilon^2}) for both the sasa- and ss-rectangular cases (improvements of S\mid S \mid and SA\mid S \mid\mid A \mid respectively). When the size of the uncertainty is small enough, we improve the sample complexity to O~(H3SAϵ2)\tilde{\mathcal{O}}(\frac{H^3 \mid S \mid\mid A \mid }{\epsilon^2}), recovering the lower-bound for the non-robust case for the first time and a robust lower-bound when the size of the uncertainty is small enough.

View on arXiv
Comments on this paper