ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.19919
21
0

Provably Adaptive Average Reward Reinforcement Learning for Metric Spaces

25 October 2024
Avik Kar
Rahul Singh
ArXivPDFHTML
Abstract

We study infinite-horizon average-reward reinforcement learning (RL) for Lipschitz MDPs and develop an algorithm ZoRL that discretizes the state-action space adaptively and zooms into promising regions of the state-action space. We show that its regret can be bounded as O~(T1−deff.−1)\mathcal{\tilde{O}}\big(T^{1 - d_{\text{eff.}}^{-1}}\big)O~(T1−deff.−1​), where deff.=2dS+dz+3d_{\text{eff.}} = 2d_\mathcal{S} + d_z + 3deff.​=2dS​+dz​+3, dSd_\mathcal{S}dS​ is the dimension of the state space, and dzd_zdz​ is the zooming dimension. dzd_zdz​ is a problem-dependent quantity, which allows us to conclude that if MDP is benign, then its regret will be small. We note that the existing notion of zooming dimension for average reward RL is defined in terms of policy coverings, and hence it can be huge when the policy class is rich even though the underlying MDP is simple, so that the regret upper bound is nearly O(T)O(T)O(T). The zooming dimension proposed in the current work is bounded above by ddd, the dimension of the state-action space, and hence is truly adaptive, i.e., shows how to capture adaptivity gains for infinite-horizon average-reward RL. ZoRL outperforms other state-of-the-art algorithms in experiments; thereby demonstrating the gains arising due to adaptivity.

View on arXiv
Comments on this paper