26
4

Combinatorial Pure Exploration with Partial or Full-Bandit Linear Feedback

Abstract

In this paper, we propose the novel model of combinatorial pure exploration with partial linear feedback (CPE-PL). In CPE-PL, given a combinatorial action space X{0,1}d\mathcal{X} \subseteq \{0,1\}^d, in each round a learner chooses one action xXx \in \mathcal{X} to play, obtains a random (possibly nonlinear) reward related to xx and an unknown latent vector θRd\theta \in \mathbb{R}^d, and observes a partial linear feedback Mx(θ+η)M_{x} (\theta + \eta), where η\eta is a zero-mean noise vector and MxM_x is a transformation matrix for xx. The objective is to identify the optimal action with the maximum expected reward using as few rounds as possible. We also study the important subproblem of CPE-PL, i.e., combinatorial pure exploration with full-bandit feedback (CPE-BL), in which the learner observes full-bandit feedback (i.e. Mx=xM_x = x^{\top}) and gains linear expected reward xθx^{\top} \theta after each play. In this paper, we first propose a polynomial-time algorithmic framework for the general CPE-PL problem with novel sample complexity analysis. Then, we propose an adaptive algorithm dedicated to the subproblem CPE-BL with better sample complexity. Our work provides a novel polynomial-time solution to simultaneously address limited feedback, general reward function and combinatorial action space including matroids, matchings, and ss-tt paths.

View on arXiv
Comments on this paper