ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2208.05633
21
3

Best Policy Identification in Linear MDPs

11 August 2022
Jerome Taupin
Yassir Jedra
Alexandre Proutière
ArXivPDFHTML
Abstract

We investigate the problem of best policy identification in discounted linear Markov Decision Processes in the fixed confidence setting under a generative model. We first derive an instance-specific lower bound on the expected number of samples required to identify an ε\varepsilonε-optimal policy with probability 1−δ1-\delta1−δ. The lower bound characterizes the optimal sampling rule as the solution of an intricate non-convex optimization program, but can be used as the starting point to devise simple and near-optimal sampling rules and algorithms. We devise such algorithms. One of these exhibits a sample complexity upper bounded by O(d(ε+Δ)2(log⁡(1δ)+d)){\cal O}({\frac{d}{(\varepsilon+\Delta)^2}} (\log(\frac{1}{\delta})+d))O((ε+Δ)2d​(log(δ1​)+d)) where Δ\DeltaΔ denotes the minimum reward gap of sub-optimal actions and ddd is the dimension of the feature space. This upper bound holds in the moderate-confidence regime (i.e., for all δ\deltaδ), and matches existing minimax and gap-dependent lower bounds. We extend our algorithm to episodic linear MDPs.

View on arXiv
Comments on this paper