413

Policy Optimization with Stochastic Mirror Descent

AAAI Conference on Artificial Intelligence (AAAI), 2019
Abstract

Improving sample efficiency has been a longstanding goal in reinforcement learning. In this paper, we propose the VRMPO\mathtt{VRMPO}: a sample efficient policy gradient method with stochastic mirror descent. A novel variance reduced policy gradient estimator is the key of VRMPO\mathtt{VRMPO} to improve sample efficiency. Our VRMPO\mathtt{VRMPO} needs only O(ϵ3)\mathcal{O}(\epsilon^{-3}) sample trajectories to achieve an ϵ\epsilon-approximate first-order stationary point, which matches the best-known sample complexity. We conduct extensive experiments to show our algorithm outperforms state-of-the-art policy gradient methods in various settings.

View on arXiv
Comments on this paper