ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2105.06029
16
19

Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings

13 May 2021
Ming Yin
Yu-Xiang Wang
    OffRL
ArXivPDFHTML
Abstract

This work studies the statistical limits of uniform convergence for offline policy evaluation (OPE) problems with model-based methods (for episodic MDP) and provides a unified framework towards optimal learning for several well-motivated offline tasks. Uniform OPE sup⁡Π∣Qπ−Q^π∣<ϵ\sup_\Pi|Q^\pi-\hat{Q}^\pi|<\epsilonsupΠ​∣Qπ−Q^​π∣<ϵ is a stronger measure than the point-wise OPE and ensures offline learning when Π\PiΠ contains all policies (the global class). In this paper, we establish an Ω(H2S/dmϵ2)\Omega(H^2 S/d_m\epsilon^2)Ω(H2S/dm​ϵ2) lower bound (over model-based family) for the global uniform OPE and our main result establishes an upper bound of O~(H2/dmϵ2)\tilde{O}(H^2/d_m\epsilon^2)O~(H2/dm​ϵ2) for the \emph{local} uniform convergence that applies to all \emph{near-empirically optimal} policies for the MDPs with \emph{stationary} transition. Here dmd_mdm​ is the minimal marginal state-action probability. Critically, the highlight in achieving the optimal rate O~(H2/dmϵ2)\tilde{O}(H^2/d_m\epsilon^2)O~(H2/dm​ϵ2) is our design of \emph{singleton absorbing MDP}, which is a new sharp analysis tool that works with the model-based approach. We generalize such a model-based framework to the new settings: offline task-agnostic and the offline reward-free with optimal complexity O~(H2log⁡(K)/dmϵ2)\tilde{O}(H^2\log(K)/d_m\epsilon^2)O~(H2log(K)/dm​ϵ2) (KKK is the number of tasks) and O~(H2S/dmϵ2)\tilde{O}(H^2S/d_m\epsilon^2)O~(H2S/dm​ϵ2) respectively. These results provide a unified solution for simultaneously solving different offline RL problems.

View on arXiv
Comments on this paper