100

On- and Off-Policy Monotonic Policy Improvement

Abstract

Monotonic policy improvement and off-policy learning are two main desirable properties for reinforcement learning algorithms. In this study, we show that the monotonic policy improvement is guaranteed from on- and off-policy mixture data. Based on the theoretical result, we provide an algorithm which uses the experience replay technique for trust region policy optimization. The proposed method can be regarded as a variant of off-policy natural policy gradient method.

View on arXiv
Comments on this paper