Bandits with Side Observations: Bounded vs. Logarithmic Regret
Abstract
We consider the classical stochastic multi-armed bandit but where, from time to time and roughly with frequency , an extra observation is gathered by the agent for free. We prove that, no matter how small is the agent can ensure a regret uniformly bounded in time. More precisely, we construct an algorithm with a regret smaller than , up to multiplicative constant and loglog terms. We also prove a matching lower-bound, stating that no reasonable algorithm can outperform this quantity.
View on arXivComments on this paper
