ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.17108
15
2

Finding good policies in average-reward Markov Decision Processes without prior knowledge

27 May 2024
Adrienne Tuynman
Rémy Degenne
Emilie Kaufmann
ArXivPDFHTML
Abstract

We revisit the identification of an ε\varepsilonε-optimal policy in average-reward Markov Decision Processes (MDP). In such MDPs, two measures of complexity have appeared in the literature: the diameter, DDD, and the optimal bias span, HHH, which satisfy H≤DH\leq DH≤D. Prior work have studied the complexity of ε\varepsilonε-optimal policy identification only when a generative model is available. In this case, it is known that there exists an MDP with D≃HD \simeq HD≃H for which the sample complexity to output an ε\varepsilonε-optimal policy is Ω(SAD/ε2)\Omega(SAD/\varepsilon^2)Ω(SAD/ε2) where SSS and AAA are the sizes of the state and action spaces. Recently, an algorithm with a sample complexity of order SAH/ε2SAH/\varepsilon^2SAH/ε2 has been proposed, but it requires the knowledge of HHH. We first show that the sample complexity required to estimate HHH is not bounded by any function of S,AS,AS,A and HHH, ruling out the possibility to easily make the previous algorithm agnostic to HHH. By relying instead on a diameter estimation procedure, we propose the first algorithm for (ε,δ)(\varepsilon,\delta)(ε,δ)-PAC policy identification that does not need any form of prior knowledge on the MDP. Its sample complexity scales in SAD/ε2SAD/\varepsilon^2SAD/ε2 in the regime of small ε\varepsilonε, which is near-optimal. In the online setting, our first contribution is a lower bound which implies that a sample complexity polynomial in HHH cannot be achieved in this setting. Then, we propose an online algorithm with a sample complexity in SAD2/ε2SAD^2/\varepsilon^2SAD2/ε2, as well as a novel approach based on a data-dependent stopping rule that we believe is promising to further reduce this bound.

View on arXiv
Comments on this paper