318

MOTS: Minimax Optimal Thompson Sampling

International Conference on Machine Learning (ICML), 2020
Abstract

Thompson sampling is one of the most widely used algorithms for many online decision problems, due to its simplicity in implementation and superior empirical performance over other state-of-the-art methods. Despite its popularity and empirical success, it has remained an open problem whether Thompson sampling can achieve the minimax optimal regret O(KT)O(\sqrt{KT}) for KK-armed bandit problems, where TT is the total time horizon. In this paper, we solve this long open problem by proposing a new Thompson sampling algorithm called MOTS that adaptively truncates the sampling result of the chosen arm at each time step. We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound O(KT)O(\sqrt{KT}) for finite time horizon TT and also the asymptotic optimal regret bound when TT grows to infinity as well. This is the first time that the minimax optimality of multi-armed bandit problems has been attained by Thompson sampling type of algorithms.

View on arXiv
Comments on this paper