60
14

Minimax Regret for Cascading Bandits

Abstract

Cascading bandits model the task of learning to rank KK out of LL items over nn rounds of partial feedback. For this model, the minimax (i.e., gap-free) regret is poorly understood; in particular, the best known lower and upper bounds are Ω(nL/K)\Omega(\sqrt{nL/K}) and O~(nLK)\tilde{O}(\sqrt{nLK}), respectively. We improve the lower bound to Ω(nL)\Omega(\sqrt{nL}) and show CascadeKL-UCB (which ranks items by their KL-UCB indices) attains it up to log terms. Surprisingly, we also show CascadeUCB1 (which ranks via UCB1) can suffer suboptimal Ω(nLK)\Omega(\sqrt{nLK}) regret. This sharply contrasts with standard LL-armed bandits, where the corresponding algorithms both achieve the minimax regret nL\sqrt{nL} (up to log terms), and the main advantage of KL-UCB is only to improve constants in the gap-dependent bounds. In essence, this contrast occurs because Pinsker's inequality is tight for hard problems in the LL-armed case but loose (by a factor of KK) in the cascading case.

View on arXiv
Comments on this paper