218
v1v2v3 (latest)

Bellman Unbiasedness: Toward Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation

Main:8 Pages
5 Figures
Bibliography:3 Pages
2 Tables
Appendix:14 Pages
Abstract

Distributional reinforcement learning improves performance by capturing environmental stochasticity, but a comprehensive theoretical understanding of its effectiveness remains elusive. In addition, the intractable element of the infinite dimensionality of distributions has been overlooked. In this paper, we present a regret analysis of distributional reinforcement learning with general value function approximation in a finite episodic Markov decision process setting. We first introduce a key notion of Bellman unbiasedness\textit{Bellman unbiasedness} which is essential for exactly learnable and provably efficient distributional updates in an online manner. Among all types of statistical functionals for representing infinite-dimensional return distributions, our theoretical results demonstrate that only moment functionals can exactly capture the statistical information. Secondly, we propose a provably efficient algorithm, SF-LSVI\texttt{SF-LSVI}, that achieves a tight regret bound of O~(dEH32K)\tilde{O}(d_E H^{\frac{3}{2}}\sqrt{K}) where HH is the horizon, KK is the number of episodes, and dEd_E is the eluder dimension of a function class.

View on arXiv
Comments on this paper