109
1

p-Mean Regret for Stochastic Bandits

Abstract

In this work, we extend the concept of the pp-mean welfare objective from social choice theory (Moulin 2004) to study pp-mean regret in stochastic multi-armed bandit problems. The pp-mean regret, defined as the difference between the optimal mean among the arms and the pp-mean of the expected rewards, offers a flexible framework for evaluating bandit algorithms, enabling algorithm designers to balance fairness and efficiency by adjusting the parameter pp. Our framework encompasses both average cumulative regret and Nash regret as special cases. We introduce a simple, unified UCB-based algorithm (Explore-Then-UCB) that achieves novel pp-mean regret bounds. Our algorithm consists of two phases: a carefully calibrated uniform exploration phase to initialize sample means, followed by the UCB1 algorithm of Auer, Cesa-Bianchi, and Fischer (2002). Under mild assumptions, we prove that our algorithm achieves a pp-mean regret bound of O~(kT12p)\tilde{O}\left(\sqrt{\frac{k}{T^{\frac{1}{2|p|}}}}\right) for all p1p \leq -1, where kk represents the number of arms and TT the time horizon. When 1<p<0-1<p<0, we achieve a regret bound of O~(k1.5T12)\tilde{O}\left(\sqrt{\frac{k^{1.5}}{T^{\frac{1}{2}}}}\right). For the range 0<p10< p \leq 1, we achieve a pp-mean regret scaling as O~(kT)\tilde{O}\left(\sqrt{\frac{k}{T}}\right), which matches the previously established lower bound up to logarithmic factors (Auer et al. 1995). This result stems from the fact that the pp-mean regret of any algorithm is at least its average cumulative regret for p1p \leq 1. In the case of Nash regret (the limit as pp approaches zero), our unified approach differs from prior work (Barman et al. 2023), which requires a new Nash Confidence Bound algorithm. Notably, we achieve the same regret bound up to constant factors using our more general method.

View on arXiv
Comments on this paper