299

Non-asymptotic Performances of Robust Markov Decision Processes

Wenhao Yang
Abstract

In this paper, we study the non-asymptotic performance of optimal policy on robust value function with true transition dynamics. The optimal robust policy is solved from a generative model or offline dataset without access to true transition dynamics. In particular, we consider three different uncertainty sets including the L1L_1, χ2\chi^2 and KL balls in both (s,a)(s,a)-rectangular and ss-rectangular assumptions. Our results show that when we assume (s,a)(s,a)-rectangular on uncertainty sets, the sample complexity is about O~(S2Aε2ρ2(1γ)4)\widetilde{O}\left(\frac{|\mathcal{S}|^2|\mathcal{A}|}{\varepsilon^2\rho^2(1-\gamma)^4}\right) in the generative model setting and O~(Sνminε2ρ2(1γ)4)\widetilde{O}\left(\frac{|\mathcal{S}|}{\nu_{\min}\varepsilon^2\rho^2(1-\gamma)^4}\right) in the offline dataset setting. While prior works on non-asymptotic performances are restricted with the KL ball and (s,a)(s,a)-rectangular assumption, we also extend our results to a more general ss-rectangular assumption, which leads to a larger sample complexity than the (s,a)(s,a)-rectangular assumption.

View on arXiv
Comments on this paper