18
18

A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback

Abstract

We present a modified tuning of the algorithm of Zimmert and Seldin [2020] for adversarial multiarmed bandits with delayed feedback, which in addition to the minimax optimal adversarial regret guarantee shown by Zimmert and Seldin simultaneously achieves a near-optimal regret guarantee in the stochastic setting with fixed delays. Specifically, the adversarial regret guarantee is O(TK+dTlogK)\mathcal{O}(\sqrt{TK} + \sqrt{dT\log K}), where TT is the time horizon, KK is the number of arms, and dd is the fixed delay, whereas the stochastic regret guarantee is O(ii(1Δilog(T)+dΔilogK)+dK1/3logK)\mathcal{O}\left(\sum_{i \neq i^*}(\frac{1}{\Delta_i} \log(T) + \frac{d}{\Delta_{i}\log K}) + d K^{1/3}\log K\right), where Δi\Delta_i are the suboptimality gaps. We also present an extension of the algorithm to the case of arbitrary delays, which is based on an oracle knowledge of the maximal delay dmaxd_{max} and achieves O(TK+DlogK+dmaxK1/3logK)\mathcal{O}(\sqrt{TK} + \sqrt{D\log K} + d_{max}K^{1/3} \log K) regret in the adversarial regime, where DD is the total delay, and O(ii(1Δilog(T)+σmaxΔilogK)+dmaxK1/3logK)\mathcal{O}\left(\sum_{i \neq i^*}(\frac{1}{\Delta_i} \log(T) + \frac{\sigma_{max}}{\Delta_{i}\log K}) + d_{max}K^{1/3}\log K\right) regret in the stochastic regime, where σmax\sigma_{max} is the maximal number of outstanding observations. Finally, we present a lower bound that matches regret upper bound achieved by the skipping technique of Zimmert and Seldin [2020] in the adversarial setting.

View on arXiv
Comments on this paper