Learning to Optimize under Non-Stationarity
We introduce algorithms that achieve state-of-the-art \emph{dynamic regret} bounds for non-stationary linear stochastic bandits setting. It captures natural applications such as advertisements allocation and dynamic pricing in a changing environment. We show how the difficulty posed by the (possibly adversarial) non-stationarity can be overcome by a novel marriage between stochastic and adversarial bandits learning algorithms. Defining and as the problem dimension, the variation budget, and the total time horizon, respectively, our main contributions are the tuned Sliding Window Upper-Confidence-Bound algorithm with optimal dynamic regret, and the tuning free bandits-over-bandits framework built on top of the Sliding Window Upper-Confidence-Bound algorithm that (surprisingly) recovers the optimal dynamic regret when the amount of non-stationarity is moderate to large, \ie, while attaining improved (compared to existing literature) dynamic regret otherwise. We further conduct extensive numerical experiments to show that our proposed algorithms can achieve superior dynamic regret performances.
View on arXiv