ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2206.03569
16
6

Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement Learning with Latent Low-Rank Structure

7 June 2022
Tyler Sam
Yudong Chen
C. Yu
    OffRL
ArXivPDFHTML
Abstract

The practicality of reinforcement learning algorithms has been limited due to poor scaling with respect to the problem size, as the sample complexity of learning an ϵ\epsilonϵ-optimal policy is Ω~(∣S∣∣A∣H3/ϵ2)\tilde{\Omega}\left(|S||A|H^3 / \epsilon^2\right)Ω~(∣S∣∣A∣H3/ϵ2) over worst case instances of an MDP with state space SSS, action space AAA, and horizon HHH. We consider a class of MDPs for which the associated optimal Q∗Q^*Q∗ function is low rank, where the latent features are unknown. While one would hope to achieve linear sample complexity in ∣S∣|S|∣S∣ and ∣A∣|A|∣A∣ due to the low rank structure, we show that without imposing further assumptions beyond low rank of Q∗Q^*Q∗, if one is constrained to estimate the QQQ function using only observations from a subset of entries, there is a worst case instance in which one must incur a sample complexity exponential in the horizon HHH to learn a near optimal policy. We subsequently show that under stronger low rank structural assumptions, given access to a generative model, Low Rank Monte Carlo Policy Iteration (LR-MCPI) and Low Rank Empirical Value Iteration (LR-EVI) achieve the desired sample complexity of O~((∣S∣+∣A∣)poly(d,H)/ϵ2)\tilde{O}\left((|S|+|A|)\mathrm{poly}(d,H)/\epsilon^2\right)O~((∣S∣+∣A∣)poly(d,H)/ϵ2) for a rank ddd setting, which is minimax optimal with respect to the scaling of ∣S∣,∣A∣|S|, |A|∣S∣,∣A∣, and ϵ\epsilonϵ. In contrast to literature on linear and low-rank MDPs, we do not require a known feature mapping, our algorithm is computationally simple, and our results hold for long time horizons. Our results provide insights on the minimal low-rank structural assumptions required on the MDP with respect to the transition kernel versus the optimal action-value function.

View on arXiv
Comments on this paper