Tail Distribution of Regret in Optimistic Reinforcement Learning
We derive instance-dependent tail bounds for the regret of optimism-based reinforcement learning in finite-horizon tabular Markov decision processes with unknown transition dynamics. Focusing on a UCBVI-type algorithm, we characterize the tail distribution of the cumulative regret over episodes, rather than only its expectation or a single high-probability quantile. We analyze two natural exploration-bonus schedules: (i) a -dependent scheme that explicitly incorporates the total number of episodes , and (ii) a -independent scheme that depends only on the current episode index. For both settings, we obtain an upper bound on that exhibits a distinctive two-regime structure: a sub-Gaussian tail starting from an instance-dependent scale up to a transition threshold, followed by a sub-Weibull tail beyond that point. We further derive corresponding instance-dependent bounds on the expected regret . The proposed algorithm depends on a tuning parameter , which balances the expected regret and the range over which the regret exhibits a sub-Gaussian tail. To the best of our knowledge, our results provide one of the first comprehensive tail-regret guarantees for a standard optimistic algorithm in episodic reinforcement learning.
View on arXiv