14
0

Allocating Divisible Resources on Arms with Unknown and Random Rewards

Abstract

We consider a decision maker allocating one unit of renewable and divisible resource in each period on a number of arms. The arms have unknown and random rewards whose means are proportional to the allocated resource and whose variances are proportional to an order bb of the allocated resource. In particular, if the decision maker allocates resource AiA_i to arm ii in a period, then the reward YiY_i isYi(Ai)=Aiμi+AibξiY_i(A_i)=A_i \mu_i+A_i^b \xi_{i}, where μi\mu_i is the unknown mean and the noise ξi\xi_{i} is independent and sub-Gaussian. When the order bb ranges from 0 to 1, the framework smoothly bridges the standard stochastic multi-armed bandit and online learning with full feedback. We design two algorithms that attain the optimal gap-dependent and gap-independent regret bounds for b[0,1]b\in [0,1], and demonstrate a phase transition at b=1/2b=1/2. The theoretical results hinge on a novel concentration inequality we have developed that bounds a linear combination of sub-Gaussian random variables whose weights are fractional, adapted to the filtration, and monotonic.

View on arXiv
Comments on this paper