Adaptive Discretization-based Non-Episodic Reinforcement Learning in
Metric Spaces
We study non-episodic Reinforcement Learning for Lipschitz MDPs in which state-action space is a metric space, and the transition kernel and rewards are Lipschitz functions. We develop computationally efficient UCB-based algorithm, that adaptively discretizes the state-action space and show that their regret as compared with -optimal policy is bounded as , where is the -zooming dimension. In contrast, if one uses the vanilla on a fixed discretization of the MDP, the regret w.r.t. a -optimal policy scales as so that the adaptivity gains are huge when . Note that the absolute regret of any 'uniformly good' algorithm for a large family of continuous MDPs asymptotically scales as at least . Though adaptive discretization has been shown to yield regret in episodic RL, an attempt to extend this to the non-episodic case by employing constant duration episodes whose duration increases with , is futile since as . The current work shows how to obtain adaptivity gains for non-episodic RL. The theoretical results are supported by simulations on two systems where the performance of is compared with that of ',' the fixed discretization-based extension of for systems with continuous state-action spaces.
View on arXiv