Learning to Optimize under Non-Stationarity
We introduce algorithms that achieve state-of-the-art \emph{dynamic regret} bounds for non-stationary linear stochastic bandits setting. It captures natural applications such as dynamic pricing and ads allocation in a changing environment. We show how the difficulty posed by the (possibly adversarial) non-stationarity can be overcome by a novel marriage between stochastic and adversarial bandits learning algorithms. Defining and as the problem dimension, the \emph{variation budget}, and the total time horizon, respectively, our main contributions are the tuned Sliding Window UCB (\texttt{SW-UCB}) algorithm with optimal dynamic regret, and the tuning free bandits-over-bandits (\texttt{BOB}) framework built on top of the \texttt{SW-UCB} algorithm with best dynamic regret.
View on arXiv