20
89

Variance-reduced QQ-learning is minimax optimal

Abstract

We introduce and analyze a form of variance-reduced QQ-learning. For γ\gamma-discounted MDPs with finite state space X\mathcal{X} and action space U\mathcal{U}, we prove that it yields an ϵ\epsilon-accurate estimate of the optimal QQ-function in the \ell_\infty-norm using O((Dϵ2(1γ)3)  log(D(1γ)))\mathcal{O} \left(\left(\frac{D}{ \epsilon^2 (1-\gamma)^3} \right) \; \log \left( \frac{D}{(1-\gamma)} \right) \right) samples, where D=X×UD = |\mathcal{X}| \times |\mathcal{U}|. This guarantee matches known minimax lower bounds up to a logarithmic factor in the discount complexity. In contrast, our past work shows that ordinary QQ-learning has worst-case quartic scaling in the discount complexity.

View on arXiv
Comments on this paper