238

Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits

International Conference on Artificial Intelligence and Statistics (AISTATS), 2014
Abstract

A stochastic combinatorial semi-bandit with a linear payoff is a sequential learning problem where at each step a learning agent chooses a subset of ground items subject to some combinatorial constraints, then observes noisy weights of all chosen items, and finally receives their sum as a payoff. In this work, we close the problem of computationally and sample efficient learning in stochastic combinatorial semi-bandits. In particular, we show that a relatively simple learning algorithm, which is known to be computationally efficient, also achieves near-optimal regret. We refer to this method as CombUCB1, and show that its nn-step regret is O(KL(1/Δ)logn)O(K L (1 / \Delta) \log n) and O(KLnlogn)O(\sqrt{K L n \log n}), where LL is the number of ground items, KK is the maximum number of chosen items, and Δ\Delta is the gap between the expected weights of the best and second best solutions. The O(KL(1/Δ)logn)O(K L (1 / \Delta) \log n) upper bound is tight up to a constant and the O(KLnlogn)O(\sqrt{K L n \log n}) upper bound is tight up to a factor of logn\sqrt{\log n}.

View on arXiv
Comments on this paper