ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2304.14989
17
0

Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards

28 April 2023
Hao Qin
Kwang-Sung Jun
Chicheng Zhang
ArXivPDFHTML
Abstract

We study KKK-armed bandit problems where the reward distributions of the arms are all supported on the [0,1][0,1][0,1] interval. It has been a challenge to design regret-efficient randomized exploration algorithms in this setting. Maillard sampling \cite{maillard13apprentissage}, an attractive alternative to Thompson sampling, has recently been shown to achieve competitive regret guarantees in the sub-Gaussian reward setting \cite{bian2022maillard} while maintaining closed-form action probabilities, which is useful for offline policy evaluation. In this work, we propose the Kullback-Leibler Maillard Sampling (KL-MS) algorithm, a natural extension of Maillard sampling for achieving KL-style gap-dependent regret bound. We show that KL-MS enjoys the asymptotic optimality when the rewards are Bernoulli and has a worst-case regret bound of the form O(μ∗(1−μ∗)KTln⁡K+Kln⁡T)O(\sqrt{\mu^*(1-\mu^*) K T \ln K} + K \ln T)O(μ∗(1−μ∗)KTlnK​+KlnT), where μ∗\mu^*μ∗ is the expected reward of the optimal arm, and TTT is the time horizon length.

View on arXiv
Comments on this paper