v1v2 (latest)

Improving Policy Optimization via $\varepsilon$ -Retrain

12 June 2024

ArXiv (abs)PDF HTML Github

Main:13 Pages

13 Figures

Bibliography:3 Pages

3 Tables

Appendix:4 Pages

Abstract

We present $\varepsilon$ -retrain, an exploration strategy encouraging a behavioral preference while optimizing policies with monotonic improvement guarantees. To this end, we introduce an iterative procedure for collecting retrain areas -- parts of the state space where an agent did not satisfy the behavioral preference. Our method switches between the typical uniform restart state distribution and the retrain areas using a decaying factor $\varepsilon$ , allowing agents to retrain on situations where they violated the preference. We also employ formal verification of neural networks to provably quantify the degree to which agents adhere to these behavioral preferences. Experiments over hundreds of seeds across locomotion, power network, and navigation tasks show that our method yields agents that exhibit significant performance and sample efficiency improvements.

View on arXiv

Comments on this paper

Improving Policy Optimization via ε\varepsilonε-Retrain

Improving Policy Optimization via $\varepsilon$ -Retrain