MNL-Bandit in non-stationary environments
Abstract
In this paper, we study the MNL-Bandit problem in a non-stationary environment and present an algorithm with worst-case dynamic regret of . Here is the number of arms, is the number of switches and is a variation measure of the unknown parameters. We also show that our algorithm is near-optimal (up to logarithmic factors). Our algorithm builds upon the epoch-based algorithm for stationary MNL-Bandit in Agrawal et al. 2016. However, non-stationarity poses several challenges and we introduce new techniques and ideas to address these. In particular, we give a tight characterization for the bias introduced in the estimators due to non stationarity and derive new concentration bounds.
View on arXivComments on this paper
