Infinite Arms Bandit: Optimality via Confidence Bounds

30 May 2018

Abstract

The infinite arms bandit problem was initiated by Berry et al. (1997). They derived a regret lower bound of all solutions for Bernoulli rewards with uniform priors, and proposed bandit strategies based on success runs, but which do not achieve this bound. Bonald and Prouti\`{e}re (2013) showed that the lower bound was achieved by their two-target algorithm, and extended optimality to Bernoulli rewards with general priors. We propose here a confidence bound target (CBT) algorithm that achieves optimality for unspecified non-negative reward distributions. For each arm we apply the mean and standard deviation of its rewards to compute a confidence bound and play the arm with the smallest confidence bound provided it is smaller than a target mean. If the bounds are all larger, then we play a new arm. We show for a given prior of the arm means how the target mean can be computed to achieve optimality. In the absence of information on the prior the target mean is determined empirically, and the regret achieved is still comparable to the regret lower bound. Numerical studies show that CBT is versatile and outperforms its competitors.

View on arXiv

Comments on this paper