23
50

Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning

Abstract

Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finite-horizon episodic Markov decision process with SS states, AA actions and horizon length HH, substantial progress has been achieved towards characterizing the minimax-optimal regret, which scales on the order of H2SAT\sqrt{H^2SAT} (modulo log factors) with TT the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memory-inefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e.g., S6A4poly(H)S^6A^4 \,\mathrm{poly}(H) for existing model-free methods). To overcome such a large sample size barrier to efficient RL, we design a novel model-free algorithm, with space complexity O(SAH)O(SAH), that achieves near-optimal regret as soon as the sample size exceeds the order of SApoly(H)SA\,\mathrm{poly}(H). In terms of this sample size requirement (also referred to the initial burn-in cost), our method improves -- by at least a factor of S5A3S^5A^3 -- upon any prior memory-efficient algorithm that is asymptotically regret-optimal. Leveraging the recently introduced variance reduction strategy (also called {\em reference-advantage decomposition}), the proposed algorithm employs an {\em early-settled} reference update rule, with the aid of two Q-learning sequences with upper and lower confidence bounds. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings that involve intricate exploration-exploitation trade-offs.

View on arXiv
Comments on this paper