19
4

Discounted Thompson Sampling for Non-Stationary Bandit Problems

Abstract

Non-stationary multi-armed bandit (NS-MAB) problems have recently received significant attention. NS-MAB are typically modelled in two scenarios: abruptly changing, where reward distributions remain constant for a certain period and change at unknown time steps, and smoothly changing, where reward distributions evolve smoothly based on unknown dynamics. In this paper, we propose Discounted Thompson Sampling (DS-TS) with Gaussian priors to address both non-stationary settings. Our algorithm passively adapts to changes by incorporating a discounted factor into Thompson Sampling. DS-TS method has been experimentally validated, but analysis of the regret upper bound is currently lacking. Under mild assumptions, we show that DS-TS with Gaussian priors can achieve nearly optimal regret bound on the order of O~(TBT)\tilde{O}(\sqrt{TB_T}) for abruptly changing and O~(Tβ)\tilde{O}(T^{\beta}) for smoothly changing, where TT is the number of time steps, BTB_T is the number of breakpoints, β\beta is associated with the smoothly changing environment and O~\tilde{O} hides the parameters independent of TT as well as logarithmic terms. Furthermore, empirical comparisons between DS-TS and other non-stationary bandit algorithms demonstrate its competitive performance. Specifically, when prior knowledge of the maximum expected reward is available, DS-TS has the potential to outperform state-of-the-art algorithms.

View on arXiv
Comments on this paper