384

Tightening the Dependence on Horizon in the Sample Complexity of Q-Learning

Operational Research (OR), 2021
Abstract

Q-learning, which seeks to learn the optimal Q-function of a Markov decision process (MDP) in a model-free fashion, lies at the heart of reinforcement learning. When it comes to the synchronous setting (such that independent samples for all state-action pairs are drawn from a generative model in each iteration), substantial progress has been made recently towards understanding the sample efficiency of Q-learning. To yield an entrywise ε\varepsilon-accurate estimate of the optimal Q-function, state-of-the-art theory requires at least an order of SA(1γ)5ε2\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^5\varepsilon^{2}} samples for a γ\gamma-discounted infinite-horizon MDP with state space S\mathcal{S} and action space A\mathcal{A}. In this work, we sharpen the sample complexity of synchronous Q-learning to an order of SA(1γ)4ε2\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^4\varepsilon^2} (up to some logarithmic factor) for any 0<ε<10<\varepsilon <1, leading to an order-wise improvement in terms of the effective horizon 11γ\frac{1}{1-\gamma}. Analogous results are derived for finite-horizon MDPs as well. Our finding unveils the effectiveness of vanilla Q-learning, which matches that of speedy Q-learning without requiring extra computation and storage. A key ingredient of our analysis lies in the establishment of novel error decompositions and recursions, which might shed light on how to analyze finite-sample performance of other Q-learning variants.

View on arXiv
Comments on this paper