p-Mean Regret for Stochastic Bandits

In this work, we extend the concept of the -mean welfare objective from social choice theory (Moulin 2004) to study -mean regret in stochastic multi-armed bandit problems. The -mean regret, defined as the difference between the optimal mean among the arms and the -mean of the expected rewards, offers a flexible framework for evaluating bandit algorithms, enabling algorithm designers to balance fairness and efficiency by adjusting the parameter . Our framework encompasses both average cumulative regret and Nash regret as special cases. We introduce a simple, unified UCB-based algorithm (Explore-Then-UCB) that achieves novel -mean regret bounds. Our algorithm consists of two phases: a carefully calibrated uniform exploration phase to initialize sample means, followed by the UCB1 algorithm of Auer, Cesa-Bianchi, and Fischer (2002). Under mild assumptions, we prove that our algorithm achieves a -mean regret bound of for all , where represents the number of arms and the time horizon. When , we achieve a regret bound of . For the range , we achieve a -mean regret scaling as , which matches the previously established lower bound up to logarithmic factors (Auer et al. 1995). This result stems from the fact that the -mean regret of any algorithm is at least its average cumulative regret for . In the case of Nash regret (the limit as approaches zero), our unified approach differs from prior work (Barman et al. 2023), which requires a new Nash Confidence Bound algorithm. Notably, we achieve the same regret bound up to constant factors using our more general method.
View on arXiv