Settling the Sample Complexity of Online Reinforcement Learning

A central issue lying at the heart of online reinforcement learning (RL) is data efficiency. While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a ``large-sample'' regime, imposing enormous burn-in cost in order for their algorithms to operate optimally. How to achieve minimax-optimal regret without incurring any burn-in cost has been an open problem in RL theory.We settle this problem for the context of finite-horizon inhomogeneous Markov decision processes. Specifically, we prove that a modified version of Monotonic Value Propagation (MVP), a model-based algorithm proposed by \cite{zhang2020reinforcement}, achieves a regret on the order of (modulo log factors) \begin{equation*}\min\big\{ \sqrt{SAH^3K}, \,HK \big\}, \end{equation*} where is the number of states, is the number of actions, is the planning horizon, and is the total number of episodes. This regret matches the minimax lower bound for the entire range of sample size , essentially eliminating any burn-in requirement. It also translates to a PAC sample complexity (i.e., the number of episodes needed to yield -accuracy) of up to log factor, which is minimax-optimal for the full -range.Further, we extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances. The key technical innovation lies in the development of a new regret decomposition strategy and a novel analysis paradigm to decouple complicated statistical dependency -- a long-standing challenge facing the analysis of online RL in the sample-hungry regime.
View on arXiv@article{zhang2025_2307.13586, title={ Settling the Sample Complexity of Online Reinforcement Learning }, author={ Zihan Zhang and Yuxin Chen and Jason D. Lee and Simon S. Du }, journal={arXiv preprint arXiv:2307.13586}, year={ 2025 } }