Self-Supervised Online Reward Shaping in Sparse-Reward Environments

IEEE/RJS International Conference on Intelligent RObots and Systems (IROS), 2021

8 March 2021

Scott Niekum

Abstract

We propose a novel reinforcement learning framework that performs self-supervised online reward shaping, yielding faster, sample efficient performance in sparse-reward environments. The proposed framework alternates between updating a policy and inferring a reward function. While the policy update is performed with the inferred, potentially dense reward function, the original sparse reward is used to provide a self-supervisory signal for the reward update by serving as an ordering over the observed trajectories. The proposed framework is based on the theory that altering the reward function does not affect the optimal policy of the original MDP as long as certain relations between the altered and the original reward are maintained. We name the proposed framework ClAssification-based Reward Shaping (CaReS), since the altered reward is learned in a self-supervised manner using classifier-based reward inference. Experimental results on several sparse-reward environments demonstrate that the proposed algorithm is not only significantly more sample efficient than the state-of-the-art reinforcement learning baseline but also achieves a similar sample efficiency to a baseline that uses hand-designed dense reward functions.

View on arXiv

Comments on this paper