$\texttt{BanditQ}:$ Fair Multi-Armed Bandits with Guaranteed Rewards per Arm

Conference on Uncertainty in Artificial Intelligence (UAI), 2023

11 April 2023

Abstract

Classic no-regret online prediction algorithms, including variants of the Upper Confidence Bound ( $\texttt{UCB}$ ) algorithm, $\texttt{Hedge}$ , and $\texttt{EXP3}$ , are inherently unfair by design. The unfairness stems from their very objective of playing the most rewarding arm as many times as possible while ignoring the less rewarding ones among $N$ arms. In this paper, we consider a fair prediction problem in the stochastic setting with hard lower bounds on the rate of accrual of rewards for a set of arms. We study the problem in both full and bandit feedback settings. Using queueing-theoretic techniques in conjunction with adversarial learning, we propose a new online prediction policy called $\texttt{BanditQ}$ that achieves the target reward rates while achieving a regret and target rate violation penalty of $O(T^{\frac{3}{4}}).$ In the full-information setting, the regret bound can be further improved to $O(\sqrt{T})$ when considering the average regret over the entire horizon of length $T$ . The proposed policy is efficient and admits a black-box reduction from the fair prediction problem to the standard MAB problem with a carefully defined sequence of rewards. The design and analysis of the $\texttt{BanditQ}$ policy involve a novel use of the potential function method in conjunction with scale-free second-order regret bounds and a new self-bounding inequality for the reward gradients, which are of independent interest.

View on arXiv

Comments on this paper

BanditQ:\texttt{BanditQ}:BanditQ: Fair Multi-Armed Bandits with Guaranteed Rewards per Arm

$\texttt{BanditQ}:$ Fair Multi-Armed Bandits with Guaranteed Rewards per Arm