150

Adaptive Discretization-based Non-Episodic Reinforcement Learning in Metric Spaces

Main:7 Pages
3 Figures
Bibliography:2 Pages
Appendix:29 Pages
Abstract

We study non-episodic Reinforcement Learning for Lipschitz MDPs in which state-action space is a metric space, and the transition kernel and rewards are Lipschitz functions. We develop computationally efficient UCB-based algorithm, ZoRL-ϵ\textit{ZoRL-}\epsilon that adaptively discretizes the state-action space and show that their regret as compared with ϵ\epsilon-optimal policy is bounded as O(ϵ(2dS+dzϵ+1)log(T))\mathcal{O}(\epsilon^{-(2 d_\mathcal{S} + d^\epsilon_z + 1)}\log{(T)}), where dzϵd^\epsilon_z is the ϵ\epsilon-zooming dimension. In contrast, if one uses the vanilla UCRL-2\textit{UCRL-}2 on a fixed discretization of the MDP, the regret w.r.t. a ϵ\epsilon-optimal policy scales as O(ϵ(2dS+d+1)log(T))\mathcal{O}(\epsilon^{-(2 d_\mathcal{S} + d + 1)}\log{(T)}) so that the adaptivity gains are huge when dzϵdd^\epsilon_z \ll d. Note that the absolute regret of any 'uniformly good' algorithm for a large family of continuous MDPs asymptotically scales as at least Ω(log(T))\Omega(\log{(T)}). Though adaptive discretization has been shown to yield O~(H2.5Kdz+1dz+2)\mathcal{\tilde{O}}(H^{2.5}K^\frac{d_z + 1}{d_z + 2}) regret in episodic RL, an attempt to extend this to the non-episodic case by employing constant duration episodes whose duration increases with TT, is futile since dzdd_z \to d as TT \to \infty. The current work shows how to obtain adaptivity gains for non-episodic RL. The theoretical results are supported by simulations on two systems where the performance of ZoRL-ϵ\textit{ZoRL-}\epsilon is compared with that of 'UCRL-C\textit{UCRL-C},' the fixed discretization-based extension of UCRL-2\textit{UCRL-}2 for systems with continuous state-action spaces.

View on arXiv
Comments on this paper