264
v1v2 (latest)

Asymptotic Behavior of Minimal-Exploration Allocation Policies: Almost Sure, Arbitrarily Slow Growing Regret

Abstract

The purpose of this paper is to provide further understanding into the structure of the sequential allocation ("stochastic multi-armed bandit", or MAB) problem by establishing probability one finite horizon bounds and convergence rates for the sample (or "pseudo") regret associated with two simple classes of allocation policies π\pi. For any slowly increasing function gg, subject to mild regularity constraints, we construct two policies (the gg-Forcing, and the gg-Inflated Sample Mean) that achieve a measure of regret of order $ O(g(n))$ almost surely as nn \to \infty, bound from above and below. Additionally, almost sure upper and lower bounds on the remainder term are established. In the constructions herein, the function gg effectively controls the "exploration" of the classical "exploration/exploitation" tradeoff.

View on arXiv
Comments on this paper