ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2110.13400
25
2

Scale-Free Adversarial Multi-Armed Bandit with Arbitrary Feedback Delays

26 October 2021
Jiatai Huang
Yan Dai
Longbo Huang
    AI4CE
ArXivPDFHTML
Abstract

We consider the Scale-Free Adversarial Multi-Armed Bandit (MAB) problem with unrestricted feedback delays. In contrast to the standard assumption that all losses are [0,1][0,1][0,1]-bounded, in our setting, losses can fall in a general bounded interval [−L,L][-L, L][−L,L], unknown to the agent beforehand. Furthermore, the feedback of each arm pull can experience arbitrary delays. We propose a novel approach named Scale-Free Delayed INF (SFD-INF) for this novel setting, which combines a recent "convex combination trick" together with a novel doubling and skipping technique. We then present two instances of SFD-INF, each with carefully designed delay-adapted learning scales. The first one SFD-TINF uses 12\frac 1221​-Tsallis entropy regularizer and can achieve O~(K(D+T)L)\widetilde{\mathcal O}(\sqrt{K(D+T)}L)O(K(D+T)​L) regret when the losses are non-negative, where KKK is the number of actions, TTT is the number of steps, and DDD is the total feedback delay. This bound nearly matches the Ω((KT+Dlog⁡K)L)\Omega((\sqrt{KT}+\sqrt{D\log K})L)Ω((KT​+DlogK​)L) lower-bound when regarding KKK as a constant independent of TTT. The second one, SFD-LBINF, works for general scale-free losses and achieves a small-loss style adaptive regret bound O~(KE[L~T2]+KDL)\widetilde{\mathcal O}(\sqrt{K\mathbb{E}[\tilde{\mathfrak L}_T^2]}+\sqrt{KDL})O(KE[L~T2​]​+KDL​), which falls to the O~(K(D+T)L)\widetilde{\mathcal O}(\sqrt{K(D+T)}L)O(K(D+T)​L) regret in the worst case and is thus more general than SFD-TINF despite a more complicated analysis and several extra logarithmic dependencies. Moreover, both instances also outperform the existing algorithms for non-delayed (i.e., D=0D=0D=0) scale-free adversarial MAB problems, which can be of independent interest.

View on arXiv
Comments on this paper