70
v1v2 (latest)

Improving Policy Optimization via ε\varepsilon-Retrain

Main:13 Pages
13 Figures
Bibliography:3 Pages
3 Tables
Appendix:4 Pages
Abstract

We present ε\varepsilon-retrain, an exploration strategy encouraging a behavioral preference while optimizing policies with monotonic improvement guarantees. To this end, we introduce an iterative procedure for collecting retrain areas -- parts of the state space where an agent did not satisfy the behavioral preference. Our method switches between the typical uniform restart state distribution and the retrain areas using a decaying factor ε\varepsilon, allowing agents to retrain on situations where they violated the preference. We also employ formal verification of neural networks to provably quantify the degree to which agents adhere to these behavioral preferences. Experiments over hundreds of seeds across locomotion, power network, and navigation tasks show that our method yields agents that exhibit significant performance and sample efficiency improvements.

View on arXiv
Comments on this paper