ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2409.20521
16
1

Upper and Lower Bounds for Distributionally Robust Off-Dynamics Reinforcement Learning

30 September 2024
Zhishuai Liu
Weixin Wang
Pan Xu
ArXivPDFHTML
Abstract

We study off-dynamics Reinforcement Learning (RL), where the policy training and deployment environments are different. To deal with this environmental perturbation, we focus on learning policies robust to uncertainties in transition dynamics under the framework of distributionally robust Markov decision processes (DRMDPs), where the nominal and perturbed dynamics are linear Markov Decision Processes. We propose a novel algorithm We-DRIVE-U that enjoys an average suboptimality O~(dH⋅min⁡{1/ρ,H}/K)\widetilde{\mathcal{O}}\big({d H \cdot \min \{1/{\rho}, H\}/\sqrt{K} }\big)O(dH⋅min{1/ρ,H}/K​), where KKK is the number of episodes, HHH is the horizon length, ddd is the feature dimension and ρ\rhoρ is the uncertainty level. This result improves the state-of-the-art by O(dH/min⁡{1/ρ,H})\mathcal{O}(dH/\min\{1/\rho,H\})O(dH/min{1/ρ,H}). We also construct a novel hard instance and derive the first information-theoretic lower bound in this setting, which indicates our algorithm is near-optimal up to O(H)\mathcal{O}(\sqrt{H})O(H​) for any uncertainty level ρ∈(0,1]\rho\in(0,1]ρ∈(0,1]. Our algorithm also enjoys a 'rare-switching' design, and thus only requires O(dHlog⁡(1+H2K))\mathcal{O}(dH\log(1+H^2K))O(dHlog(1+H2K)) policy switches and O(d2Hlog⁡(1+H2K))\mathcal{O}(d^2H\log(1+H^2K))O(d2Hlog(1+H2K)) calls for oracle to solve dual optimization problems, which significantly improves the computational efficiency of existing algorithms for DRMDPs, whose policy switch and oracle complexities are both O(K)\mathcal{O}(K)O(K).

View on arXiv
Comments on this paper