86
104
v1v2v3v4v5 (latest)

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

Abstract

We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves O~(LXAT)\mathcal{\tilde{O}}(L|X|\sqrt{|A|T}) regret with high probability, where LL is the horizon, X|X| is the number of states, A|A| is the number of actions, and TT is the number of episodes. To the best of our knowledge, our algorithm is the first to ensure O~(T)\mathcal{\tilde{O}}(\sqrt{T}) regret in this challenging setting; in fact it achieves the same regret bound as (Rosenberg & Mansour, 2019a) that considers an easier setting with full-information feedback. Our key technical contributions are two-fold: a tighter confidence set for the transition function, and an optimistic loss estimator that is inversely weighted by an upper occupancy bound\textit{upper occupancy bound}.

View on arXiv
Comments on this paper