Policy Optimization with Stochastic Mirror Descent
AAAI Conference on Artificial Intelligence (AAAI), 2019
Abstract
Improving sample efficiency has been a longstanding goal in reinforcement learning. In this paper, we propose the : a sample efficient policy gradient method with stochastic mirror descent. A novel variance reduced policy gradient estimator is the key of to improve sample efficiency. Our needs only sample trajectories to achieve an -approximate first-order stationary point, which matches the best-known sample complexity. We conduct extensive experiments to show our algorithm outperforms state-of-the-art policy gradient methods in various settings.
View on arXivComments on this paper
