Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits

International Conference on Artificial Intelligence and Statistics (AISTATS), 2014

3 October 2014

Abstract

A stochastic combinatorial semi-bandit with a linear payoff is a sequential learning problem where at each step a learning agent chooses a subset of ground items subject to some combinatorial constraints, then observes noisy weights of all chosen items, and finally receives their sum as a payoff. In this work, we close the problem of computationally and sample efficient learning in stochastic combinatorial semi-bandits. In particular, we show that a relatively simple learning algorithm, which is known to be computationally efficient, also achieves near-optimal regret. We refer to this method as CombUCB1, and show that its $n$ -step regret is $O(K L (1 / \Delta) \log n)$ and $O(\sqrt{K L n \log n})$ , where $L$ is the number of ground items, $K$ is the maximum number of chosen items, and $\Delta$ is the gap between the expected weights of the best and second best solutions. The $O(K L (1 / \Delta) \log n)$ upper bound is tight up to a constant and the $O(\sqrt{K L n \log n})$ upper bound is tight up to a factor of $\sqrt{\log n}$ .

View on arXiv

Comments on this paper