Asymptotic Behavior of Minimal-Exploration Allocation Policies: Almost
Sure, Arbitrarily Slow Growing Regret
Consider the problem of sampling sequentially from a finite number of populations or `bandits', where each population is specified by a sequence of random variables , representing the reward received the time population is sampled. For each , the $\{ X^i_k \}_{k \geq 1} $ are taken to be i.i.d. random variables with finite mean. For any slowly increasing function , subject to mild regularity constraints, we construct two policies (the -Forcing, and the -Inflated Sample Mean) that achieve a measure of regret of order $ O(g(n))$ almost surely as . Additionally, asymptotic probability one bounds on the remainder term are established. In the constructions herein, the function effectively controls the `exploration' of the classical `exploration/exploitation' tradeoff.
View on arXiv