191
v1v2 (latest)

Stochastic Multi-armed Bandits in Constant Space

International Conference on Artificial Intelligence and Statistics (AISTATS), 2017
Abstract

We consider the stochastic bandit problem in the sublinear space setting, where one cannot record the win-loss record for all KK arms. We give an algorithm using O(1)O(1) words of space with regret \[ \sum_{i=1}^{K}\frac{1}{\Delta_i}\log \frac{\Delta_i}{\Delta}\log T \] where Δi\Delta_i is the gap between the best arm and arm ii and Δ\Delta is the gap between the best and the second-best arms. If the rewards are bounded away from 00 and 11, this is within an O(log1/Δ)O(\log 1/\Delta) factor of the optimum regret possible without space constraints.

View on arXiv
Comments on this paper