10

Bandit Allocational Instability

Yilun Chen
Jiaqi Lu
Main:31 Pages
1 Figures
Bibliography:1 Pages
Appendix:1 Pages
Abstract

When multi-armed bandit (MAB) algorithms allocate pulls among competing arms, the resulting allocation can exhibit huge variation. This is particularly harmful in modern applications such as learning-enhanced platform operations and post-bandit statistical inference. Thus motivated, we introduce a new performance metric of MAB algorithms termed allocation variability, which is the largest (over arms) standard deviation of an arm's number of pulls. We establish a fundamental trade-off between allocation variability and regret, the canonical performance metric of reward maximization. In particular, for any algorithm, the worst-case regret RTR_T and worst-case allocation variability STS_T must satisfy RTST=Ω(T32)R_T \cdot S_T=\Omega(T^{\frac{3}{2}}) as TT\rightarrow\infty, as long as RT=o(T)R_T=o(T). This indicates that any minimax regret-optimal algorithm must incur worst-case allocation variability Θ(T)\Theta(T), the largest possible scale; while any algorithm with sublinear worst-case regret must necessarily incur ST=ω(T){S}_T= \omega(\sqrt{T}). We further show that this lower bound is essentially tight, and that any point on the Pareto frontier RTST=Θ~(T3/2)R_T \cdot S_T=\tilde{\Theta}(T^{3/2}) can be achieved by a simple tunable algorithm UCB-f, a generalization of the classic UCB1. Finally, we discuss implications for platform operations and for statistical inference, when bandit algorithms are used. As a byproduct of our result, we resolve an open question of Praharaj and Khamaru (2025).

View on arXiv
Comments on this paper