ISL: Optimal Policy Learning With Optimal Exploration-Exploitation
Trade-Off
Maximum entropy reinforcement learning (RL) has received considerable attention recently. Some of the algorithms within this framework exhibit state of the art performance in many challenging tasks. These algorithms exhibit improved exploration; however, they are still inefficient at performing deep exploration. The contribution of this paper is the introduction of a new kind of soft RL algorithm (referred to as the ISL strategy) that is efficient at performing deep exploration. Similarly to maximum entropy RL, we achieve this objective by augmenting the traditional RL objective with a novel regularization term. A distinctive feature of our approach is that, as opposed to other works that tackle the problem of deep exploration, in our derivation both the learning equations and the exploration-exploitation strategy are derived in tandem as the solution to a well-posed optimization problem whose minimization leads to the optimal value function. Empirically we show that our method exhibits state of the art performance on a range of challenging deep-exploration benchmarks.
View on arXiv