219

Tractable and Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation

Main:8 Pages
5 Figures
Bibliography:3 Pages
2 Tables
Appendix:14 Pages
Abstract

Distributional reinforcement learning improves performance by effectively capturing environmental stochasticity, but a comprehensive theoretical understanding of its effectiveness remains elusive. In this paper, we present a regret analysis for distributional reinforcement learning with general value function approximation in a finite episodic Markov decision process setting. We first introduce a key notion of Bellman unbiasedness for a tractable and exactly learnable update via statistical functional dynamic programming. Our theoretical results show that approximating the infinite-dimensional return distribution with a finite number of moment functionals is the only method to learn the statistical information unbiasedly, including nonlinear statistical functionals. Second, we propose a provably efficient algorithm, SF-LSVI\texttt{SF-LSVI}, achieving a regret bound of O~(dEH32K)\tilde{O}(d_E H^{\frac{3}{2}}\sqrt{K}) where HH is the horizon, KK is the number of episodes, and dEd_E is the eluder dimension of a function class.

View on arXiv
Comments on this paper