ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.12713
25
4

How Does Variance Shape the Regret in Contextual Bandits?

16 October 2024
Zeyu Jia
Jian Qian
Alexander Rakhlin
Chen-Yu Wei
ArXivPDFHTML
Abstract

We consider realizable contextual bandits with general function approximation, investigating how small reward variance can lead to better-than-minimax regret bounds. Unlike in minimax bounds, we show that the eluder dimension delud_\text{elu}delu​−-−a complexity measure of the function class−-−plays a crucial role in variance-dependent bounds. We consider two types of adversary: (1) Weak adversary: The adversary sets the reward variance before observing the learner's action. In this setting, we prove that a regret of Ω(min⁡{A,delu}Λ+delu)\Omega(\sqrt{\min\{A,d_\text{elu}\}\Lambda}+d_\text{elu})Ω(min{A,delu​}Λ​+delu​) is unavoidable when delu≤ATd_{\text{elu}}\leq\sqrt{AT}delu​≤AT​, where AAA is the number of actions, TTT is the total number of rounds, and Λ\LambdaΛ is the total variance over TTT rounds. For the A≤deluA\leq d_\text{elu}A≤delu​ regime, we derive a nearly matching upper bound O~(AΛ+delu)\tilde{O}(\sqrt{A\Lambda}+d_\text{elu})O~(AΛ​+delu​) for the special case where the variance is revealed at the beginning of each round. (2) Strong adversary: The adversary sets the reward variance after observing the learner's action. We show that a regret of Ω(deluΛ+delu)\Omega(\sqrt{d_\text{elu}\Lambda}+d_\text{elu})Ω(delu​Λ​+delu​) is unavoidable when deluΛ+delu≤AT\sqrt{d_\text{elu}\Lambda}+d_\text{elu}\leq\sqrt{AT}delu​Λ​+delu​≤AT​. In this setting, we provide an upper bound of order O~(deluΛ+delu)\tilde{O}(d_\text{elu}\sqrt{\Lambda}+d_\text{elu})O~(delu​Λ​+delu​). Furthermore, we examine the setting where the function class additionally provides distributional information of the reward, as studied by Wang et al. (2024). We demonstrate that the regret bound O~(deluΛ+delu)\tilde{O}(\sqrt{d_\text{elu}\Lambda}+d_\text{elu})O~(delu​Λ​+delu​) established in their work is unimprovable when deluΛ+delu≤AT\sqrt{d_{\text{elu}}\Lambda}+d_\text{elu}\leq\sqrt{AT}delu​Λ​+delu​≤AT​. However, with a slightly different definition of the total variance and with the assumption that the reward follows a Gaussian distribution, one can achieve a regret of O~(AΛ+delu)\tilde{O}(\sqrt{A\Lambda}+d_\text{elu})O~(AΛ​+delu​).

View on arXiv
Comments on this paper