Combinatorial Pure Exploration with Partial or Full-Bandit Linear Feedback
- OffRL

In this paper, we propose the novel model of combinatorial pure exploration with partial linear feedback (CPE-PL). In CPE-PL, given a combinatorial action space , in each round a learner chooses one action to play, obtains a random (possibly nonlinear) reward related to and an unknown latent vector , and observes a partial linear feedback , where is a zero-mean noise vector and is a transformation matrix for . The objective is to identify the optimal action with the maximum expected reward using as few rounds as possible. We also study the important subproblem of CPE-PL, i.e., combinatorial pure exploration with full-bandit feedback (CPE-BL), in which the learner observes full-bandit feedback (i.e. ) and gains linear expected reward after each play. In this paper, we first propose a polynomial-time algorithmic framework for the general CPE-PL problem with novel sample complexity analysis. Then, we propose an adaptive algorithm dedicated to the subproblem CPE-BL with better sample complexity. Our work provides a novel polynomial-time solution to simultaneously address limited feedback, general reward function and combinatorial action space including matroids, matchings, and - paths.
View on arXiv