ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2102.06548
15
75

Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis

12 February 2021
Gen Li
Changxiao Cai
Ee
Yuting Wei
Yuejie Chi
    OffRL
ArXivPDFHTML
Abstract

Q-learning, which seeks to learn the optimal Q-function of a Markov decision process (MDP) in a model-free fashion, lies at the heart of reinforcement learning. When it comes to the synchronous setting (such that independent samples for all state-action pairs are drawn from a generative model in each iteration), substantial progress has been made towards understanding the sample efficiency of Q-learning. Consider a γ\gammaγ-discounted infinite-horizon MDP with state space S\mathcal{S}S and action space A\mathcal{A}A: to yield an entrywise ε\varepsilonε-approximation of the optimal Q-function, state-of-the-art theory for Q-learning requires a sample size exceeding the order of ∣S∣∣A∣(1−γ)5ε2\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^5\varepsilon^{2}}(1−γ)5ε2∣S∣∣A∣​, which fails to match existing minimax lower bounds. This gives rise to natural questions: what is the sharp sample complexity of Q-learning? Is Q-learning provably sub-optimal? This paper addresses these questions for the synchronous setting: (1) when ∣A∣=1|\mathcal{A}|=1∣A∣=1 (so that Q-learning reduces to TD learning), we prove that the sample complexity of TD learning is minimax optimal and scales as ∣S∣(1−γ)3ε2\frac{|\mathcal{S}|}{(1-\gamma)^3\varepsilon^2}(1−γ)3ε2∣S∣​ (up to log factor); (2) when ∣A∣≥2|\mathcal{A}|\geq 2∣A∣≥2, we settle the sample complexity of Q-learning to be on the order of ∣S∣∣A∣(1−γ)4ε2\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^4\varepsilon^2}(1−γ)4ε2∣S∣∣A∣​ (up to log factor). Our theory unveils the strict sub-optimality of Q-learning when ∣A∣≥2|\mathcal{A}|\geq 2∣A∣≥2, and rigorizes the negative impact of over-estimation in Q-learning. Finally, we extend our analysis to accommodate asynchronous Q-learning (i.e., the case with Markovian samples), sharpening the horizon dependency of its sample complexity to be 1(1−γ)4\frac{1}{(1-\gamma)^4}(1−γ)41​.

View on arXiv
Comments on this paper