24

Online Learning in MDPs with Partially Adversarial Transitions and Losses

Ofir Schlisselberg
Tal Lancewicki
Yishay Mansour
Main:13 Pages
Bibliography:2 Pages
Appendix:48 Pages
Abstract

We study reinforcement learning in MDPs whose transition function is stochastic at most steps but may behave adversarially at a fixed subset of Λ\Lambda steps per episode. This model captures environments that are stable except at a few vulnerable points. We introduce \emph{conditioned occupancy measures}, which remain stable across episodes even with adversarial transitions, and use them to design two algorithms. The first handles arbitrary adversarial steps and achieves regret O~(HSΛKSAΛ+1)\tilde{O}(H S^{\Lambda}\sqrt{K S A^{\Lambda+1}}), where KK is the number of episodes, SS is the number of state, AA is the number of actions and HH is the episode's horizon. The second, assuming the adversarial steps are consecutive, improves the dependence on SS to O~(HKS3AΛ+1)\tilde{O}(H\sqrt{K S^{3} A^{\Lambda+1}}). We further give a K2/3K^{2/3}-regret reduction that removes the need to know which steps are the Λ\Lambda adversarial steps. We also characterize the regret of adversarial MDPs in the \emph{fully adversarial} setting (Λ=H1\Lambda=H-1) both for full-information and bandit feedback, and provide almost matching upper and lower bounds (slightly strengthen existing lower bounds, and clarify how different feedback structures affect the hardness of learning).

View on arXiv
Comments on this paper