71
0

Adversarial Combinatorial Semi-bandits with Graph Feedback

Abstract

In combinatorial semi-bandits, a learner repeatedly selects from a combinatorial decision set of arms, receives the realized sum of rewards, and observes the rewards of the individual selected arms as feedback. In this paper, we extend this framework to include \emph{graph feedback}, where the learner observes the rewards of all neighboring arms of the selected arms in a feedback graph GG. We establish that the optimal regret over a time horizon TT scales as Θ~(ST+αST)\widetilde{\Theta}(S\sqrt{T}+\sqrt{\alpha ST}), where SS is the size of the combinatorial decisions and α\alpha is the independence number of GG. This result interpolates between the known regrets Θ~(ST)\widetilde\Theta(S\sqrt{T}) under full information (i.e., GG is complete) and Θ~(KST)\widetilde\Theta(\sqrt{KST}) under the semi-bandit feedback (i.e., GG has only self-loops), where KK is the total number of arms. A key technical ingredient is to realize a convexified action using a random decision vector with negative correlations. We also show that online stochastic mirror descent (OSMD) that only realizes convexified actions in expectation is suboptimal.

View on arXiv
@article{wen2025_2502.18826,
  title={ Adversarial Combinatorial Semi-bandits with Graph Feedback },
  author={ Yuxiao Wen },
  journal={arXiv preprint arXiv:2502.18826},
  year={ 2025 }
}
Comments on this paper