ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2305.19562
14
12

Replicability in Reinforcement Learning

31 May 2023
Amin Karbasi
Grigoris Velegkas
Lin F. Yang
Felix Y. Zhou
ArXivPDFHTML
Abstract

We initiate the mathematical study of replicability as an algorithmic property in the context of reinforcement learning (RL). We focus on the fundamental setting of discounted tabular MDPs with access to a generative model. Inspired by Impagliazzo et al. [2022], we say that an RL algorithm is replicable if, with high probability, it outputs the exact same policy after two executions on i.i.d. samples drawn from the generator when its internal randomness is the same. We first provide an efficient ρ\rhoρ-replicable algorithm for (ε,δ)(\varepsilon, \delta)(ε,δ)-optimal policy estimation with sample and time complexity O~(N3⋅log⁡(1/δ)(1−γ)5⋅ε2⋅ρ2)\widetilde O\left(\frac{N^3\cdot\log(1/\delta)}{(1-\gamma)^5\cdot\varepsilon^2\cdot\rho^2}\right)O((1−γ)5⋅ε2⋅ρ2N3⋅log(1/δ)​), where NNN is the number of state-action pairs. Next, for the subclass of deterministic algorithms, we provide a lower bound of order Ω(N3(1−γ)3⋅ε2⋅ρ2)\Omega\left(\frac{N^3}{(1-\gamma)^3\cdot\varepsilon^2\cdot\rho^2}\right)Ω((1−γ)3⋅ε2⋅ρ2N3​). Then, we study a relaxed version of replicability proposed by Kalavasis et al. [2023] called TV indistinguishability. We design a computationally efficient TV indistinguishable algorithm for policy estimation whose sample complexity is O~(N2⋅log⁡(1/δ)(1−γ)5⋅ε2⋅ρ2)\widetilde O\left(\frac{N^2\cdot\log(1/\delta)}{(1-\gamma)^5\cdot\varepsilon^2\cdot\rho^2}\right)O((1−γ)5⋅ε2⋅ρ2N2⋅log(1/δ)​). At the cost of exp⁡(N)\exp(N)exp(N) running time, we transform these TV indistinguishable algorithms to ρ\rhoρ-replicable ones without increasing their sample complexity. Finally, we introduce the notion of approximate-replicability where we only require that two outputted policies are close under an appropriate statistical divergence (e.g., Renyi) and show an improved sample complexity of O~(N⋅log⁡(1/δ)(1−γ)5⋅ε2⋅ρ2)\widetilde O\left(\frac{N\cdot\log(1/\delta)}{(1-\gamma)^5\cdot\varepsilon^2\cdot\rho^2}\right)O((1−γ)5⋅ε2⋅ρ2N⋅log(1/δ)​).

View on arXiv
Comments on this paper