262

BanditQ:\texttt{BanditQ}: Fair Multi-Armed Bandits with Guaranteed Rewards per Arm

Conference on Uncertainty in Artificial Intelligence (UAI), 2023
Abstract

Classic no-regret online prediction algorithms, including variants of the Upper Confidence Bound (UCB\texttt{UCB}) algorithm, Hedge\texttt{Hedge}, and EXP3\texttt{EXP3}, are inherently unfair by design. The unfairness stems from their very objective of playing the most rewarding arm as many times as possible while ignoring the less rewarding ones among NN arms. In this paper, we consider a fair prediction problem in the stochastic setting with hard lower bounds on the rate of accrual of rewards for a set of arms. We study the problem in both full and bandit feedback settings. Using queueing-theoretic techniques in conjunction with adversarial learning, we propose a new online prediction policy called BanditQ\texttt{BanditQ} that achieves the target reward rates while achieving a regret and target rate violation penalty of O(T34).O(T^{\frac{3}{4}}). In the full-information setting, the regret bound can be further improved to O(T)O(\sqrt{T}) when considering the average regret over the entire horizon of length TT. The proposed policy is efficient and admits a black-box reduction from the fair prediction problem to the standard MAB problem with a carefully defined sequence of rewards. The design and analysis of the BanditQ\texttt{BanditQ} policy involve a novel use of the potential function method in conjunction with scale-free second-order regret bounds and a new self-bounding inequality for the reward gradients, which are of independent interest.

View on arXiv
Comments on this paper