Decentralized Multi-Armed Bandit with Multiple Distributed Players
We consider multi-armed bandit with distributed players, where each player independently samples one of N stochastic processes with unknown parameters and accrues reward in each slot without information exchange. Users choosing the same arm collide, and none or only one receives reward depending on the collision model. This problem can be formulated as a decentralized multi-armed bandit problem. We measure the performance of a decentralized policy by the system regret, defined as the total reward loss with respect to the optimal performance under the perfect scenario where all arm parameters are known to all users and collisions among users are eliminated through perfect scheduling. We show that the minimum system regret grows with time at the same logarithmic order as in the centralized counterpart, where users exchange observations and make decisions jointly. A decentralized policy is constructed to achieve this optimal order. Furthermore, we show that the proposed policy belongs to a general class of decentralized polices, for which a uniform performance benchmark is established.
View on arXiv