33
35

Stochastic Multi-armed Bandits in Constant Space

Abstract

We consider the stochastic bandit problem in the sublinear space setting, where one cannot record the win-loss record for all KK arms. We give an algorithm using O(1)O(1) words of space with regret \[ \sum_{i=1}^{K}\frac{1}{\Delta_i}\log \frac{\Delta_i}{\Delta}\log T \] where Δi\Delta_i is the gap between the best arm and arm ii and Δ\Delta is the gap between the best and the second-best arms. If the rewards are bounded away from 00 and 11, this is within an O(log1/Δ)O(\log 1/\Delta) factor of the optimum regret possible without space constraints.

View on arXiv
Comments on this paper