ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.16394
11
16

Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

28 June 2023
Zihan Zhang
Qiaomin Xie
    OffRL
ArXivPDFHTML
Abstract

We develop several provably efficient model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov Decision Processes (MDPs). We consider both online setting and the setting with access to a simulator. In the online setting, we propose model-free RL algorithms based on reference-advantage decomposition. Our algorithm achieves O~(S5A2sp(h∗)T)\widetilde{O}(S^5A^2\mathrm{sp}(h^*)\sqrt{T})O(S5A2sp(h∗)T​) regret after TTT steps, where S×AS\times AS×A is the size of state-action space, and sp(h∗)\mathrm{sp}(h^*)sp(h∗) the span of the optimal bias function. Our results are the first to achieve optimal dependence in TTT for weakly communicating MDPs. In the simulator setting, we propose a model-free RL algorithm that finds an ϵ\epsilonϵ-optimal policy using O~(SAsp2(h∗)ϵ2+S2Asp(h∗)ϵ)\widetilde{O} \left(\frac{SA\mathrm{sp}^2(h^*)}{\epsilon^2}+\frac{S^2A\mathrm{sp}(h^*)}{\epsilon} \right)O(ϵ2SAsp2(h∗)​+ϵS2Asp(h∗)​) samples, whereas the minimax lower bound is Ω(SAsp(h∗)ϵ2)\Omega\left(\frac{SA\mathrm{sp}(h^*)}{\epsilon^2}\right)Ω(ϵ2SAsp(h∗)​). Our results are based on two new techniques that are unique in the average-reward setting: 1) better discounted approximation by value-difference estimation; 2) efficient construction of confidence region for the optimal bias function with space complexity O(SA)O(SA)O(SA).

View on arXiv
Comments on this paper