Distributional reinforcement learning improves performance by capturing environmental stochasticity, but a comprehensive theoretical understanding of its effectiveness remains elusive. In addition, the intractable element of the infinite dimensionality of distributions has been overlooked. In this paper, we present a regret analysis of distributional reinforcement learning with general value function approximation in a finite episodic Markov decision process setting. We first introduce a key notion of which is essential for exactly learnable and provably efficient distributional updates in an online manner. Among all types of statistical functionals for representing infinite-dimensional return distributions, our theoretical results demonstrate that only moment functionals can exactly capture the statistical information. Secondly, we propose a provably efficient algorithm, , that achieves a tight regret bound of where is the horizon, is the number of episodes, and is the eluder dimension of a function class.
View on arXiv@article{cho2025_2407.21260, title={ Bellman Unbiasedness: Tractable and Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation }, author={ Taehyun Cho and Seungyub Han and Kyungjae Lee and Seokhun Ju and Dohyeong Kim and Jungwoo Lee }, journal={arXiv preprint arXiv:2407.21260}, year={ 2025 } }