25
4

Highway Reinforcement Learning

Abstract

Learning from multi-step off-policy data collected by a set of policies is a core problem of reinforcement learning (RL). Approaches based on importance sampling (IS) often suffer from large variances due to products of IS ratios. Typical IS-free methods, such as nn-step Q-learning, look ahead for nn time steps along the trajectory of actions (where nn is called the lookahead depth) and utilize off-policy data directly without any additional adjustment. They work well for proper choices of nn. We show, however, that such IS-free methods underestimate the optimal value function (VF), especially for large nn, restricting their capacity to efficiently utilize information from distant future time steps. To overcome this problem, we introduce a novel, IS-free, multi-step off-policy method that avoids the underestimation issue and converges to the optimal VF. At its core lies a simple but non-trivial \emph{highway gate}, which controls the information flow from the distant future by comparing it to a threshold. The highway gate guarantees convergence to the optimal VF for arbitrary nn and arbitrary behavioral policies. It gives rise to a novel family of off-policy RL algorithms that safely learn even when nn is very large, facilitating rapid credit assignment from the far future to the past. On tasks with greatly delayed rewards, including video games where the reward is given only at the end of the game, our new methods outperform many existing multi-step off-policy algorithms.

View on arXiv
Comments on this paper