191

Non-stationary Linear Bandits Revisited

International Conference on Artificial Intelligence and Statistics (AISTATS), 2020
Abstract

In this note, we revisit non-stationary linear bandits, a variant of stochastic linear bandits with a time-varying underlying regression parameter. Existing studies develop various algorithms and show that they enjoy an O~(T2/3(1+PT)1/3)\widetilde{O}(T^{2/3}(1+P_T)^{1/3}) dynamic regret, where TT is the time horizon and PTP_T is the path-length that measures the fluctuation of the evolving unknown parameter. However, we discover that a serious technical flaw makes the argument ungrounded. We revisit the analysis and present a fix. Without modifying original algorithms, we can prove an O~(T3/4(1+PT)1/4)\widetilde{O}(T^{3/4}(1+P_T)^{1/4}) dynamic regret for these algorithms, slightly worse than the rate as was anticipated. We also show some impossibility results for the key quantity concerned in the regret analysis. Note that the above dynamic regret guarantee requires an oracle knowledge of the path-length PTP_T. Combining the bandit-over-bandit mechanism, we can also achieve the same guarantee in a parameter-free way.

View on arXiv
Comments on this paper