v1v2v3 (latest)

Tail Distribution of Regret in Optimistic Reinforcement Learning

23 November 2025

Sajad Khodadadian

Mehrdad Moharrami

ArXiv (abs)PDF HTML Github

Main:15 Pages

Bibliography:2 Pages

Abstract

We derive instance-dependent tail bounds for the regret of optimism-based reinforcement learning in finite-horizon tabular Markov decision processes with unknown transition dynamics. We first study a UCBVI-type (model-based) algorithm and characterize the tail distribution of the cumulative regret $R_K$ over $K$ episodes via explicit bounds on $P(R_K \ge x)$ , going beyond analyses limited to $E[R_K]$ or a single high-probability quantile. We analyze two natural exploration-bonus schedules for UCBVI: (i) a $K$ -dependent scheme that explicitly incorporates the total number of episodes $K$ , and (ii) a $K$ -independent (anytime) scheme that depends only on the current episode index. We then complement the model-based results with an analysis of optimistic Q-learning (model-free) under a $K$ -dependent bonus schedule.Across both the model-based and model-free settings, we obtain upper bounds on $P(R_K \ge x)$ with a distinctive two-regime structure: a sub-Gaussian tail starting from an instance-dependent scale up to a transition threshold, followed by a sub-Weibull tail beyond that point. We further derive corresponding instance-dependent bounds on the expected regret $E[R_K]$ . The proposed algorithms depend on a tuning parameter $\alpha$ , which balances the expected regret and the range over which the regret exhibits sub-Gaussian decay.

View on arXiv

Comments on this paper