12
17

Instance-optimal PAC Algorithms for Contextual Bandits

Abstract

In the stochastic contextual bandit setting, regret-minimizing algorithms have been extensively researched, but their instance-minimizing best-arm identification counterparts remain seldom studied. In this work, we focus on the stochastic bandit problem in the (ϵ,δ)(\epsilon,\delta)-PAC\textit{PAC} setting: given a policy class Π\Pi the goal of the learner is to return a policy πΠ\pi\in \Pi whose expected reward is within ϵ\epsilon of the optimal policy with probability greater than 1δ1-\delta. We characterize the first instance-dependent\textit{instance-dependent} PAC sample complexity of contextual bandits through a quantity ρΠ\rho_{\Pi}, and provide matching upper and lower bounds in terms of ρΠ\rho_{\Pi} for the agnostic and linear contextual best-arm identification settings. We show that no algorithm can be simultaneously minimax-optimal for regret minimization and instance-dependent PAC for best-arm identification. Our main result is a new instance-optimal and computationally efficient algorithm that relies on a polynomial number of calls to an argmax oracle.

View on arXiv
Comments on this paper