ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1803.04623
66
130
v1v2v3v4v5 (latest)

Thompson Sampling for Combinatorial Semi-Bandits

13 March 2018
Siwei Wang
Wei Chen
ArXiv (abs)PDFHTML
Abstract

We study the application of the Thompson sampling (TS) methodology to the stochastic combinatorial multi-armed bandit (CMAB) framework. We analyze the standard TS algorithm for the general CMAB, and obtain the first distribution-dependent regret bound of O(mKmax⁡log⁡T/Δmin⁡)O(mK_{\max}\log T / \Delta_{\min})O(mKmax​logT/Δmin​), where mmm is the number of arms, Kmax⁡K_{\max}Kmax​ is the size of the largest super arm, TTT is the time horizon, and Δmin⁡\Delta_{\min}Δmin​ is the minimum gap between the expected reward of the optimal solution and any non-optimal solution. We also show that one cannot directly replace the exact offline oracle with an approximation oracle in TS algorithm for even the classical MAB problem. Then we expand the analysis to two special cases: the linear reward case and the matroid bandit case. When the reward function is linear, the regret of the TS algorithm achieves a better bound O(mKmax⁡log⁡T/Δmin⁡)O(m\sqrt{K_{\max}}\log T / \Delta_{\min})O(mKmax​​logT/Δmin​). For matroid bandit, we could remove the independence assumption across arms and achieve a regret upper bound that matches the lower bound for the matroid case. Finally, we use some experiments to show the comparison between regrets of TS and other existing algorithms like CUCB and ESCB.

View on arXiv
Comments on this paper