KL-Entropy-Regularized RL with a Generative Model is Minimax Optimal
Tadashi Kozuno
Wenhao Yang
Nino Vieillard
Toshinori Kitamura
Yunhao Tang
Jincheng Mei
Pierre Ménard
M. G. Azar
Michal Valko
Rémi Munos
Olivier Pietquin
M. Geist
Csaba Szepesvári

Abstract
In this work, we consider and analyze the sample complexity of model-free reinforcement learning with a generative model. Particularly, we analyze mirror descent value iteration (MDVI) by Geist et al. (2019) and Vieillard et al. (2020a), which uses the Kullback-Leibler divergence and entropy regularization in its value and policy updates. Our analysis shows that it is nearly minimax-optimal for finding an -optimal policy when is sufficiently small. This is the first theoretical result that demonstrates that a simple model-free algorithm without variance-reduction can be nearly minimax-optimal under the considered setting.
View on arXivComments on this paper