38
3

Understanding Memory-Regret Trade-Off for Streaming Stochastic Multi-Armed Bandits

Abstract

We study the stochastic multi-armed bandit problem in the PP-pass streaming model. In this problem, the nn arms are present in a stream and at most m<nm<n arms and their statistics can be stored in the memory. We give a complete characterization of the optimal regret in terms of m,nm, n and PP. Specifically, we design an algorithm with O~((nm)1+2P22P+11n22P+12P+11T2P2P+11)\tilde O\left((n-m)^{1+\frac{2^{P}-2}{2^{P+1}-1}} n^{\frac{2-2^{P+1}}{2^{P+1}-1}} T^{\frac{2^P}{2^{P+1}-1}}\right) regret and complement it with an Ω~((nm)1+2P22P+11n22P+12P+11T2P2P+11)\tilde \Omega\left((n-m)^{1+\frac{2^{P}-2}{2^{P+1}-1}} n^{\frac{2-2^{P+1}}{2^{P+1}-1}} T^{\frac{2^P}{2^{P+1}-1}}\right) lower bound when the number of rounds TT is sufficiently large. Our results are tight up to a logarithmic factor in nn and PP.

View on arXiv
Comments on this paper