14
4

Online Learning with Feedback Graphs: The True Shape of Regret

Abstract

Sequential learning with feedback graphs is a natural extension of the multi-armed bandit problem where the problem is equipped with an underlying graph structure that provides additional information - playing an action reveals the losses of all the neighbors of the action. This problem was introduced by \citet{mannor2011} and received considerable attention in recent years. It is generally stated in the literature that the minimax regret rate for this problem is of order αT\sqrt{\alpha T}, where α\alpha is the independence number of the graph, and TT is the time horizon. However, this is proven only when the number of rounds TT is larger than α3\alpha^3, which poses a significant restriction for the usability of this result in large graphs. In this paper, we define a new quantity RR^*, called the \emph{problem complexity}, and prove that the minimax regret is proportional to RR^* for any graph and time horizon TT. Introducing an intricate exploration strategy, we define the \mainAlgorithm algorithm that achieves the minimax optimal regret bound and becomes the first provably optimal algorithm for this setting, even if TT is smaller than α3\alpha^3.

View on arXiv
Comments on this paper