20
18

Tracking Most Significant Arm Switches in Bandits

Abstract

In bandit with distribution shifts, one aims to automatically adapt to unknown changes in reward distribution, and restart exploration when necessary. While this problem has been studied for many years, a recent breakthrough of Auer et al. (2018, 2019) provides the first adaptive procedure to guarantee an optimal (dynamic) regret LT\sqrt{LT}, for TT rounds, and an unknown number LL of changes. However, while this rate is tight in the worst case, it remained open whether faster rates are possible, without prior knowledge, if few changes in distribution are actually severe. To resolve this question, we propose a new notion of significant shift, which only counts very severe changes that clearly necessitate a restart: roughly, these are changes involving not only best arm switches, but also involving large aggregate differences in reward overtime. Thus, our resulting procedure adaptively achieves rates always faster (sometimes significantly) than O(ST)O(\sqrt{ST}), where SLS\ll L only counts best arm switches, while at the same time, always faster than the optimal O(V13T23)O(V^{\frac{1}{3}}T^{\frac{2}{3}}) when expressed in terms of total variation VV (which aggregates differences overtime). Our results are expressed in enough generality to also capture non-stochastic adversarial settings.

View on arXiv
Comments on this paper