ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2303.05606
11
8

Variance-aware robust reinforcement learning with linear function approximation under heavy-tailed rewards

9 March 2023
Xiang Li
Qiang Sun
ArXivPDFHTML
Abstract

This paper presents two algorithms, AdaOFUL and VARA, for online sequential decision-making in the presence of heavy-tailed rewards with only finite variances. For linear stochastic bandits, we address the issue of heavy-tailed rewards by modifying the adaptive Huber regression and proposing AdaOFUL. AdaOFUL achieves a state-of-the-art regret bound of O~(d(∑t=1Tνt2)1/2+d)\widetilde{O}\big(d\big(\sum_{t=1}^T \nu_{t}^2\big)^{1/2}+d\big)O(d(∑t=1T​νt2​)1/2+d) as if the rewards were uniformly bounded, where νt2\nu_{t}^2νt2​ is the observed conditional variance of the reward at round ttt, ddd is the feature dimension, and O~(⋅)\widetilde{O}(\cdot)O(⋅) hides logarithmic dependence. Building upon AdaOFUL, we propose VARA for linear MDPs, which achieves a tighter variance-aware regret bound of O~(dHG∗K)\widetilde{O}(d\sqrt{HG^*K})O(dHG∗K​). Here, HHH is the length of episodes, KKK is the number of episodes, and G∗G^*G∗ is a smaller instance-dependent quantity that can be bounded by other instance-dependent quantities when additional structural conditions on the MDP are satisfied. Our regret bound is superior to the current state-of-the-art bounds in three ways: (1) it depends on a tighter instance-dependent quantity and has optimal dependence on ddd and HHH, (2) we can obtain further instance-dependent bounds of G∗G^*G∗ under additional structural conditions on the MDP, and (3) our regret bound is valid even when rewards have only finite variances, achieving a level of generality unmatched by previous works. Overall, our modified adaptive Huber regression algorithm may serve as a useful building block in the design of algorithms for online problems with heavy-tailed rewards.

View on arXiv
Comments on this paper