171

Nearly Optimal Adaptive Procedure for Piecewise-Stationary Bandit: a Change-Point Detection Approach

Abstract

Multi-armed bandit (MAB) is a class of online learning problems where a learning agent aims to maximize its expected cumulative reward while repeatedly selecting to pull arms with unknown reward distributions. In this paper, we consider a scenario in which the arms' reward distributions may change in a piecewise-stationary fashion at unknown time steps. By connecting change-detection techniques with classic UCB algorithms, we motivate and propose a learning algorithm called M-UCB, which can detect and adapt to changes, for the considered scenario. We also establish an O(MKTlogT)O(\sqrt{MKT\log T}) regret bound for M-UCB, where TT is the number of time steps, KK is the number of arms, and MM is the number of stationary segments. Comparison with the best available lower bound shows that M-UCB is nearly optimal in TT up to a logarithmic factor. We also compare M-UCB with state-of-the-art algorithms in a numerical experiment based on a public Yahoo! dataset. In this experiment, M-UCB achieves about 50%50 \% regret reduction with respect to the best performing state-of-the-art algorithm.

View on arXiv
Comments on this paper