192
v1v2 (latest)

The Intrinsic Robustness of Stochastic Bandits to Strategic Manipulation

International Conference on Machine Learning (ICML), 2019
Abstract

Motivated by economic applications such as recommender systems, we study the behavior of stochastic bandits algorithms under \emph{strategic behavior} conducted by rational actors, i.e., the arms. Each arm is a \emph{self-interested} strategic player who can modify its own reward whenever pulled, subject to a cross-period budget constraint, in order to maximize its own expected number of times of being pulled. We analyze the robustness of three popular bandit algorithms: UCB, ε\varepsilon-Greedy, and Thompson Sampling. We prove that all three algorithms achieve a regret upper bound O(max{B,KlnT})\mathcal{O}(\max \{ B, K\ln T\}) where BB is the total budget across arms, KK is the total number of arms and TT is length of the time horizon. This regret guarantee holds under \emph{arbitrary adaptive} manipulation strategy of arms. Our second set of main results shows that this regret bound is \emph{tight} -- in fact for UCB it is tight even when we restrict the arms' manipulation strategies to form a \emph{Nash equilibrium}. The lower bound makes use of a simple manipulation strategy, the same for all three algorithms, yielding a bound of Ω(max{B,KlnT})\Omega(\max \{ B, K\ln T\}). Our results illustrate the robustness of classic bandits algorithms against strategic manipulations as long as B=o(T)B=o(T).

View on arXiv
Comments on this paper