We investigate the problem of best policy identification in discounted linear Markov Decision Processes in the fixed confidence setting under a generative model. We first derive an instance-specific lower bound on the expected number of samples required to identify an -optimal policy with probability . The lower bound characterizes the optimal sampling rule as the solution of an intricate non-convex optimization program, but can be used as the starting point to devise simple and near-optimal sampling rules and algorithms. We devise such algorithms. One of these exhibits a sample complexity upper bounded by where denotes the minimum reward gap of sub-optimal actions and is the dimension of the feature space. This upper bound holds in the moderate-confidence regime (i.e., for all ), and matches existing minimax and gap-dependent lower bounds. We extend our algorithm to episodic linear MDPs.
View on arXiv