207

Bandits with Side Observations: Bounded vs. Logarithmic Regret

Abstract

We consider the classical stochastic multi-armed bandit but where, from time to time and roughly with frequency ϵ\epsilon, an extra observation is gathered by the agent for free. We prove that, no matter how small ϵ\epsilon is the agent can ensure a regret uniformly bounded in time. More precisely, we construct an algorithm with a regret smaller than ilog(1/ϵ)Δi\sum_i \frac{\log(1/\epsilon)}{\Delta_i}, up to multiplicative constant and loglog terms. We also prove a matching lower-bound, stating that no reasonable algorithm can outperform this quantity.

View on arXiv
Comments on this paper