ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1906.10462
31
34

Policy Optimization with Stochastic Mirror Descent

25 June 2019
Long Yang
Yu Zhang
Gang Zheng
Qian Zheng
Pengfei Li
Jianhang Huang
Jun Wen
Gang Pan
ArXivPDFHTML
Abstract

Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes VRMPO\mathtt{VRMPO}VRMPO algorithm: a sample efficient policy gradient method with stochastic mirror descent. In VRMPO\mathtt{VRMPO}VRMPO, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed VRMPO\mathtt{VRMPO}VRMPO needs only O(ϵ−3)\mathcal{O}(\epsilon^{-3})O(ϵ−3) sample trajectories to achieve an ϵ\epsilonϵ-approximate first-order stationary point, which matches the best sample complexity for policy optimization. The extensive experimental results demonstrate that VRMPO\mathtt{VRMPO}VRMPO outperforms the state-of-the-art policy gradient methods in various settings.

View on arXiv
Comments on this paper