49
5

Thompson Sampling for Real-Valued Combinatorial Pure Exploration of Multi-Armed Bandit

Abstract

We study the real-valued combinatorial pure exploration of the multi-armed bandit (R-CPE-MAB) problem. In R-CPE-MAB, a player is given dd stochastic arms, and the reward of each arm s{1,,d}s\in\{1, \ldots, d\} follows an unknown distribution with mean μs\mu_s. In each time step, a player pulls a single arm and observes its reward. The player's goal is to identify the optimal \emph{action} π=arg maxπAμπ\boldsymbol{\pi}^{*} = \argmax_{\boldsymbol{\pi} \in \mathcal{A}} \boldsymbol{\mu}^{\top}\boldsymbol{\pi} from a finite-sized real-valued \emph{action set} ARd\mathcal{A}\subset \mathbb{R}^{d} with as few arm pulls as possible. Previous methods in the R-CPE-MAB assume that the size of the action set A\mathcal{A} is polynomial in dd. We introduce an algorithm named the Generalized Thompson Sampling Explore (GenTS-Explore) algorithm, which is the first algorithm that can work even when the size of the action set is exponentially large in dd. We also introduce a novel problem-dependent sample complexity lower bound of the R-CPE-MAB problem, and show that the GenTS-Explore algorithm achieves the optimal sample complexity up to a problem-dependent constant factor.

View on arXiv
Comments on this paper