29
1
v1v2 (latest)

Fair Multi-Agent Bandits

Abstract

In this paper, we study the problem of fair multi-agent multi-arm bandit learning when agents do not communicate with each other, except collision information, provided to agents accessing the same arm simultaneously. We provide an algorithm with regret O(N3logBΔf(logT)logT)O\left(N^3 \log \frac{B}{\Delta} f(\log T) \log T \right) (assuming bounded rewards, with unknown bound), where f(t)f(t) is any function diverging to infinity with tt. This significantly improves previous results which had the same upper bound on the regret of order O(f(logT)logT)O(f(\log T) \log T ) but an exponential dependence on the number of agents. The result is attained by using a distributed auction algorithm to learn the sample-optimal matching and a novel order-statistics-based regret analysis. Simulation results present the dependence of the regret on logT\log T.

View on arXiv
Comments on this paper