ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1906.05110
11
71

Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function

12 June 2019
Zihan Zhang
Xiangyang Ji
ArXivPDFHTML
Abstract

We present an algorithm based on the \emph{Optimism in the Face of Uncertainty} (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently. By evaluating the state-pair difference of the optimal bias function h∗h^{*}h∗, the proposed algorithm achieves a regret bound of O~(SAHT)\tilde{O}(\sqrt{SAHT})O~(SAHT​)\footnote{The symbol O~\tilde{O}O~ means OOO with log factors ignored. } for MDP with SSS states and AAA actions, in the case that an upper bound HHH on the span of h∗h^{*}h∗, i.e., sp(h∗)sp(h^{*})sp(h∗) is known. This result outperforms the best previous regret bounds O~(SAHT)\tilde{O}(S\sqrt{AHT}) O~(SAHT​)\citep{fruit2019improved} by a factor of S\sqrt{S}S​. Furthermore, this regret bound matches the lower bound of Ω(SAHT)\Omega(\sqrt{SAHT}) Ω(SAHT​)\citep{jaksch2010near} up to a logarithmic factor. As a consequence, we show that there is a near optimal regret bound of O~(SADT)\tilde{O}(\sqrt{SADT})O~(SADT​) for MDPs with a finite diameter DDD compared to the lower bound of Ω(SADT)\Omega(\sqrt{SADT}) Ω(SADT​)\citep{jaksch2010near}.

View on arXiv
Comments on this paper