Allocating Divisible Resources on Arms with Unknown and Random Rewards

We consider a decision maker allocating one unit of renewable and divisible resource in each period on a number of arms. The arms have unknown and random rewards whose means are proportional to the allocated resource and whose variances are proportional to an order of the allocated resource. In particular, if the decision maker allocates resource to arm in a period, then the reward is, where is the unknown mean and the noise is independent and sub-Gaussian. When the order ranges from 0 to 1, the framework smoothly bridges the standard stochastic multi-armed bandit and online learning with full feedback. We design two algorithms that attain the optimal gap-dependent and gap-independent regret bounds for , and demonstrate a phase transition at . The theoretical results hinge on a novel concentration inequality we have developed that bounds a linear combination of sub-Gaussian random variables whose weights are fractional, adapted to the filtration, and monotonic.
View on arXiv