411
v1v2v3 (latest)

Rate-Optimal Policy Optimization for Linear Markov Decision Processes

International Conference on Machine Learning (ICML), 2023
Abstract

We study regret minimization in online episodic linear Markov Decision Processes, and obtain rate-optimal O~(K)\widetilde O (\sqrt K) regret where KK denotes the number of episodes. Our work is the first to establish the optimal (w.r.t.~KK) rate of convergence in the stochastic setting with bandit feedback using a policy optimization based approach, and the first to establish the optimal (w.r.t.~KK) rate in the adversarial setup with full information feedback, for which no algorithm with an optimal rate guarantee is currently known.

View on arXiv
Comments on this paper