ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2112.10935
25
14

Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

21 December 2021
Tianhao Wu
Yunchang Yang
Han Zhong
Liwei Wang
S. Du
Jiantao Jiao
ArXivPDFHTML
Abstract

Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. However, theoretical understanding of these methods remains insufficient. Even in the episodic (time-inhomogeneous) tabular setting, the state-of-the-art theoretical result of policy-based method in \citet{shani2020optimistic} is only O~(S2AH4K)\tilde{O}(\sqrt{S^2AH^4K})O~(S2AH4K​) where SSS is the number of states, AAA is the number of actions, HHH is the horizon, and KKK is the number of episodes, and there is a SH\sqrt{SH}SH​ gap compared with the information theoretic lower bound Ω~(SAH3K)\tilde{\Omega}(\sqrt{SAH^3K})Ω~(SAH3K​). To bridge such a gap, we propose a novel algorithm Reference-based Policy Optimization with Stable at Any Time guarantee (\algnameacro), which features the property "Stable at Any Time". We prove that our algorithm achieves O~(SAH3K+AH4K)\tilde{O}(\sqrt{SAH^3K} + \sqrt{AH^4K})O~(SAH3K​+AH4K​) regret. When S>HS > HS>H, our algorithm is minimax optimal when ignoring logarithmic factors. To our best knowledge, RPO-SAT is the first computationally efficient, nearly minimax optimal policy-based algorithm for tabular RL.

View on arXiv
Comments on this paper