51
0

Restless Linear Bandits

Abstract

A more general formulation of the linear bandit problem is considered to allow for dependencies over time. Specifically, it is assumed that there exists an unknown Rd\mathbb{R}^d-valued stationary φ\varphi-mixing sequence of parameters (θt, tN)(\theta_t,~t \in \mathbb{N}) which gives rise to pay-offs. This instance of the problem can be viewed as a generalization of both the classical linear bandits with iid noise, and the finite-armed restless bandits. In light of the well-known computational hardness of optimal policies for restless bandits, an approximation is proposed whose error is shown to be controlled by the φ\varphi-dependence between consecutive θt\theta_t. An optimistic algorithm, called LinMix-UCB, is proposed for the case where θt\theta_t has an exponential mixing rate. The proposed algorithm is shown to incur a sub-linear regret of O(dnpolylog(n))\mathcal{O}\left(\sqrt{d n\mathrm{polylog}(n) }\right) with respect to an oracle that always plays a multiple of Eθt\mathbb{E}\theta_t. The main challenge in this setting is to ensure that the exploration-exploitation strategy is robust against long-range dependencies. The proposed method relies on Berbee's coupling lemma to carefully select near-independent samples and construct confidence ellipsoids around empirical estimates of Eθt\mathbb{E}\theta_t.

View on arXiv
Comments on this paper