723

Efficient Near-Optimal Algorithm for Online Shortest Paths in Directed Acyclic Graphs with Bandit Feedback Against Adaptive Adversaries

Annual Conference Computational Learning Theory (COLT), 2025
Main:11 Pages
8 Figures
Bibliography:3 Pages
2 Tables
Appendix:34 Pages
Abstract

In this paper, we study the online shortest path problem in directed acyclic graphs (DAGs) under bandit feedback against an adaptive adversary. Given a DAG G=(V,E)G = (V, E) with a source node vsv_{\mathsf{s}} and a sink node vtv_{\mathsf{t}}, let X{0,1}EX \subseteq \{0,1\}^{|E|} denote the set of all paths from vsv_{\mathsf{s}} to vtv_{\mathsf{t}}. At each round tt, we select a path xtX\mathbf{x}_t \in X and receive bandit feedback on our loss xt,yt[1,1]\langle \mathbf{x}_t, \mathbf{y}_t \rangle \in [-1,1], where yt\mathbf{y}_t is an adversarially chosen loss vector. Our goal is to minimize regret with respect to the best path in hindsight over TT rounds. We propose the first computationally efficient algorithm to achieve a near-minimax optimal regret bound of O~(ETlogX)\tilde O(\sqrt{|E|T\log |X|}) with high probability against any adaptive adversary, where O~()\tilde O(\cdot) hides logarithmic factors in the number of edges E|E|. Our algorithm leverages a novel loss estimator and a centroid-based decomposition in a nontrivial manner to attain this regret bound.As an application, we show that our algorithm for DAGs provides state-of-the-art efficient algorithms for mm-sets, extensive-form games, the Colonel Blotto game, shortest walks in directed graphs, hypercubes, and multi-task multi-armed bandits, achieving improved high-probability regret guarantees in all these settings.

View on arXiv
Comments on this paper