267

Asymptotic Behavior of Minimal-Exploration Allocation Policies: Almost Sure, Arbitrarily Slow Growing Regret

Abstract

Consider the problem of sampling sequentially from a finite number of N2N \ge 2 populations or `bandits', where each population ii is specified by a sequence of random variables {Xki}k1\{ X^i_k \}_{k \geq 1}, XkiX^i_k representing the reward received the kthk^{th} time population ii is sampled. For each ii, the $\{ X^i_k \}_{k \geq 1} $ are taken to be i.i.d. random variables with finite mean. For any slowly increasing function gg, subject to mild regularity constraints, we construct two policies (the gg-Forcing, and the gg-Inflated Sample Mean) that achieve a measure of regret of order $ O(g(n))$ almost surely as nn \to \infty. Additionally, asymptotic probability one bounds on the remainder term are established. In the constructions herein, the function gg effectively controls the `exploration' of the classical `exploration/exploitation' tradeoff.

View on arXiv
Comments on this paper