15
12

Gap-Dependent Unsupervised Exploration for Reinforcement Learning

Abstract

For the problem of task-agnostic reinforcement learning (RL), an agent first collects samples from an unknown environment without the supervision of reward signals, then is revealed with a reward and is asked to compute a corresponding near-optimal policy. Existing approaches mainly concern the worst-case scenarios, in which no structural information of the reward/transition-dynamics is utilized. Therefore the best sample upper bound is O~(1/ϵ2)\propto\widetilde{\mathcal{O}}(1/\epsilon^2), where ϵ>0\epsilon>0 is the target accuracy of the obtained policy, and can be overly pessimistic. To tackle this issue, we provide an efficient algorithm that utilizes a gap parameter, ρ>0\rho>0, to reduce the amount of exploration. In particular, for an unknown finite-horizon Markov decision process, the algorithm takes only O~(1/ϵ(H3SA/ρ+H4S2A))\widetilde{\mathcal{O}} (1/\epsilon \cdot (H^3SA / \rho + H^4 S^2 A) ) episodes of exploration, and is able to obtain an ϵ\epsilon-optimal policy for a post-revealed reward with sub-optimality gap at least ρ\rho, where SS is the number of states, AA is the number of actions, and HH is the length of the horizon, obtaining a nearly \emph{quadratic saving} in terms of ϵ\epsilon. We show that, information-theoretically, this bound is nearly tight for ρ<Θ(1/(HS))\rho < \Theta(1/(HS)) and H>1H>1. We further show that O~(1)\propto\widetilde{\mathcal{O}}(1) sample bound is possible for H=1H=1 (i.e., multi-armed bandit) or with a sampling simulator, establishing a stark separation between those settings and the RL setting.

View on arXiv
Comments on this paper