Slowly Changing Adversarial Bandit Algorithms are Efficient for Discounted MDPs

International Conference on Algorithmic Learning Theory (ALT), 2022

18 May 2022

Abstract

Reinforcement learning generalizes bandit problems with additional difficulties on longer planning horizon and unknown transition kernel. We show that, under some mild assumptions, *any* slowly changing adversarial bandit algorithm enjoys optimal regret in adversarial bandits can achieve optimal (in the dependency of $T$ ) expected regret in infinite-horizon discounted MDPs, without the presence of Bellman backups. The slowly changing property required by our generalization is mild, which is also marked by the online Markov decision process literature. We also examine the applicability of our reduction to a well-known adversarial bandit algorithm, EXP3.

View on arXiv

Comments on this paper