217
57

Non-Stationary Bandits with Habituation and Recovery Dynamics

Abstract

Many settings require a decision maker to repeatedly choose from a set of interventions to apply to an individual without knowing the interventions' efficacy a priori. However, repeated application of a specific intervention may reduce its efficacy, while abstaining from applying an intervention may cause its efficacy to recover. Such phenomena are observed in many real world settings such as personalized healthcare-adherence improving interventions and targeted online advertising. Though finding an optimal intervention policy for models with this structure is PSPACE-complete, we propose and analyze a new class of models called ROGUE (Reducing or Gaining Unknown Efficacy) bandits, which we show in this paper can capture these phenomena and can be efficiently solved. We first present a consistent maximum likelihood approach to estimate the parameters of these models, and conduct a statistical analysis to construct finite sample concentration bounds. These statistical bounds are used to derive an upper confidence bound strategy that we call the ROGUE Upper Confidence Bound (ROGUE-UCB) algorithm. Our theoretical analysis shows that the ROGUE-UCB algorithm achieves logarithmic in time regret, unlike existing algorithms which result in linear regret. We conclude with a numerical experiment using real world data from a personalized healthcare-adherence improving intervention to increase physical activity. Here, the goal is to optimize the selection of messages (e.g., confidence increasing vs. knowledge increasing) to send to each individual each day to increase adherence and physical activity. Our results show that ROGUE-UCB performs better in terms of aggregated regret and average reward when compared to state of the art algorithms, and the use of ROGUE-UCB increases daily step counts by roughly 1,000 steps a day (about a half-mile more of walking) as compared to other algorithms.

View on arXiv
Comments on this paper