Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits
A stochastic combinatorial semi-bandit with a linear payoff is a sequential learning problem where at each step a learning agent chooses a subset of ground items subject to some combinatorial constraints, then observes noisy weights of all chosen items, and finally receives their sum as a payoff. In this work, we close the problem of computationally and sample efficient learning in stochastic combinatorial semi-bandits. In particular, we show that a relatively simple learning algorithm, which is known to be computationally efficient, also achieves near-optimal regret. We refer to this method as CombUCB1, and show that its -step regret is and , where is the number of ground items, is the maximum number of chosen items, and is the gap between the expected weights of the best and second best solutions. The upper bound is tight up to a constant and the upper bound is tight up to a factor of .
View on arXiv