Contextual Bandits with Cross-learning

25 September 2018

S. Balseiro

Vahab Mirrokni

Abstract

In the classical contextual bandits problem, in each round $t$ , a learner observes some context $c$ , chooses some action $a$ to perform, and receives some reward $r_{a,t}(c)$ . We consider the variant of this problem where in addition to receiving the reward $r_{a,t}(c)$ , the learner also learns the values of $r_{a,t}(c')$ for all other contexts $c'$ ; i.e., the rewards that would have been achieved by performing that action under different contexts. This variant arises in several strategic settings, such as learning how to bid in non-truthful repeated auctions (in this setting the context is the decision maker's private valuation for each auction). We call this problem the contextual bandits problem with cross-learning. The best algorithms for the classical contextual bandits problem achieve $\tilde{O}(\sqrt{CKT})$ regret against all stationary policies, where $C$ is the number of contexts, $K$ the number of actions, and $T$ the number of rounds. We demonstrate algorithms for the contextual bandits problem with cross-learning that remove the dependence on $C$ and achieve regret $O(\sqrt{KT})$ (when contexts are stochastic with known distribution), $\tilde{O}(K^{1/3}T^{2/3})$ (when contexts are stochastic with unknown distribution), and $\tilde{O}(\sqrt{KT})$ (when contexts are adversarial but rewards are stochastic).

View on arXiv

Comments on this paper