293

Doubly Optimal No-Regret Online Learning in Strongly Monotone Games with Bandit Feedback

Operational Research (OR), 2021
Abstract

We consider online no-regret learning in unknown games with bandit feedback, where each player can only observe its reward at each time -- determined by all players' current joint action -- rather than its gradient. We focus on the class of smooth and strongly monotone games and study optimal no-regret learning therein. Leveraging self-concordant barrier functions, we first construct a new bandit learning algorithm and show that it achieves the single-agent optimal regret of Θ~(nT)\tilde{\Theta}(n\sqrt{T}) under smooth and strongly concave reward functions (n1n \geq 1 is the problem dimension). We then show that if each player applies this no-regret learning algorithm in strongly monotone games, the joint action converges in the last iterate to the unique Nash equilibrium at a rate of Θ~(n2T)\tilde{\Theta}(\sqrt{\frac{n^2}{T}}). Prior to our work, the best-known convergence rate in the same class of games is O~(n2T3)\tilde{O}(\sqrt[3]{\frac{n^2}{T}}) (achieved by a different algorithm), thus leaving open the problem of optimal no-regret learning algorithms (since the known lower bound is Ω(n2T)\Omega(\sqrt{\frac{n^2}{T}})). Our results thus settle this open problem and contribute to the broad landscape of bandit game-theoretical learning by identifying the first doubly optimal bandit learning algorithm, in that it achieves (up to log factors) both optimal regret in the single-agent learning and optimal last-iterate convergence rate in the multi-agent learning. We also present results on several application studies -- Cournot competition, Kelly auctions, and distributed regularized logistic regression -- to demonstrate the efficacy of our algorithm.

View on arXiv
Comments on this paper