614

Learning to Optimize under Non-Stationarity

Abstract

We introduce algorithms that achieve state-of-the-art \emph{dynamic regret} bounds for non-stationary linear stochastic bandits setting. It captures natural applications such as advertisements allocation and dynamic pricing in a changing environment. We show how the difficulty posed by the (possibly adversarial) non-stationarity can be overcome by a novel marriage between stochastic and adversarial bandits learning algorithms. Defining d,BT,d,B_T, and TT as the problem dimension, the variation budget, and the total time horizon, respectively, our main contributions are the tuned Sliding Window Upper-Confidence-Bound algorithm with optimal O~(d2/3BT1/3T2/3)\widetilde{O}\left(d^{2/3}B_T^{1/3}T^{2/3}\right) dynamic regret, and the tuning free bandits-over-bandits framework built on top of the Sliding Window Upper-Confidence-Bound algorithm that (surprisingly) recovers the optimal O~(d2/3BT1/3T2/3)\widetilde{O}\left(d^{2/3}B_T^{1/3}T^{2/3}\right) dynamic regret when the amount of non-stationarity is moderate to large, \ie, BTd1/2T1/4;B_T\geq d^{-1/2}T^{1/4}; while attaining improved (compared to existing literature) O~(d1/2T3/4)\widetilde{O}\left(d^{1/2}T^{3/4}\right) dynamic regret otherwise. We further conduct extensive numerical experiments to show that our proposed algorithms can achieve superior dynamic regret performances.

View on arXiv
Comments on this paper