Policy Optimization With Penalized Point Probability Distance: An
Alternative To Proximal Policy Optimization
This paper proposes a first order gradient reinforcement learning algorithm, which can be seen as a variant for Trust Region Policy Optimization(TRPO). This method, which we call policy optimization with penalized point probability distance (POP3D), keeps almost all advantageous spheres of proximal policy optimization (PPO) such as easy implementation, fast learning and high score capability. In specific, a new surrogate objective without constraint is proposed, where the point probability distance is applied to prevent update step from growing too large while contributing to more exploration and stability than Kullback-Leibler divergence. Conclusions can be drawn based on Gym Atari and Mujoco experiments that POP3D is an alternative to PPO, because it achieves state-of-the-art within 40 million frame steps on 49 Atari games and competitive scores in continuous domain according to two common metrics: final performance and fast learning ability. Moreover, we release the code on github https://github.com/cxxgtxy/POP3D.git.
View on arXiv