Slowly Changing Adversarial Bandit Algorithms are Efficient for
Discounted MDPs
International Conference on Algorithmic Learning Theory (ALT), 2022
Abstract
Reinforcement learning generalizes bandit problems with additional difficulties on longer planning horizon and unknown transition kernel. We show that, under some mild assumptions, *any* slowly changing adversarial bandit algorithm enjoys optimal regret in adversarial bandits can achieve optimal (in the dependency of ) expected regret in infinite-horizon discounted MDPs, without the presence of Bellman backups. The slowly changing property required by our generalization is mild, which is also marked by the online Markov decision process literature. We also examine the applicability of our reduction to a well-known adversarial bandit algorithm, EXP3.
View on arXivComments on this paper
