Learning to Optimize under Non-Stationarity

6 October 2018

Abstract

We introduce algorithms that achieve state-of-the-art \emph{dynamic regret} bounds for non-stationary linear stochastic bandits setting. It captures natural applications such as advertisements allocation and dynamic pricing in a changing environment. We show how the difficulty posed by the (possibly adversarial) non-stationarity can be overcome by a novel marriage between stochastic and adversarial bandits learning algorithms. Defining $d,B_T,$ and $T$ as the problem dimension, the variation budget, and the total time horizon, respectively, our main contributions are the tuned Sliding Window Upper-Confidence-Bound algorithm with optimal $\widetilde{O}\left(d^{2/3}B_T^{1/3}T^{2/3}\right)$ dynamic regret, and the tuning free bandits-over-bandits framework built on top of the Sliding Window Upper-Confidence-Bound algorithm that (surprisingly) recovers the optimal $\widetilde{O}\left(d^{2/3}B_T^{1/3}T^{2/3}\right)$ dynamic regret when the amount of non-stationarity is moderate to large, \ie, $B_T\geq d^{-1/2}T^{1/4};$ while attaining improved (compared to existing literature) $\widetilde{O}\left(d^{1/2}T^{3/4}\right)$ dynamic regret otherwise. We further conduct extensive numerical experiments to show that our proposed algorithms can achieve superior dynamic regret performances.

View on arXiv

Comments on this paper