Adaptive Discretization-based Non-Episodic Reinforcement Learning in Metric Spaces

29 May 2024

Avik Kar

Rahul Singh

ArXiv (abs)PDF HTML Github

Main:7 Pages

3 Figures

Bibliography:2 Pages

Appendix:29 Pages

Abstract

We study non-episodic Reinforcement Learning for Lipschitz MDPs in which state-action space is a metric space, and the transition kernel and rewards are Lipschitz functions. We develop computationally efficient UCB-based algorithm, $\textit{ZoRL-}\epsilon$ that adaptively discretizes the state-action space and show that their regret as compared with $\epsilon$ -optimal policy is bounded as $\mathcal{O}(\epsilon^{-(2 d_\mathcal{S} + d^\epsilon_z + 1)}\log{(T)})$ , where $d^\epsilon_z$ is the $\epsilon$ -zooming dimension. In contrast, if one uses the vanilla $\textit{UCRL-}2$ on a fixed discretization of the MDP, the regret w.r.t. a $\epsilon$ -optimal policy scales as $\mathcal{O}(\epsilon^{-(2 d_\mathcal{S} + d + 1)}\log{(T)})$ so that the adaptivity gains are huge when $d^\epsilon_z \ll d$ . Note that the absolute regret of any 'uniformly good' algorithm for a large family of continuous MDPs asymptotically scales as at least $\Omega(\log{(T)})$ . Though adaptive discretization has been shown to yield $\mathcal{\tilde{O}}(H^{2.5}K^\frac{d_z + 1}{d_z + 2})$ regret in episodic RL, an attempt to extend this to the non-episodic case by employing constant duration episodes whose duration increases with $T$ , is futile since $d_z \to d$ as $T \to \infty$ . The current work shows how to obtain adaptivity gains for non-episodic RL. The theoretical results are supported by simulations on two systems where the performance of $\textit{ZoRL-}\epsilon$ is compared with that of ' $\textit{UCRL-C}$ ,' the fixed discretization-based extension of $\textit{UCRL-}2$ for systems with continuous state-action spaces.

View on arXiv

Comments on this paper