28

No-Regret Thompson Sampling for Finite-Horizon Markov Decision Processes with Gaussian Processes

Main:9 Pages
4 Figures
Bibliography:4 Pages
Appendix:16 Pages
Abstract

Thompson sampling (TS) is a powerful and widely used strategy for sequential decision-making, with applications ranging from Bayesian optimization to reinforcement learning (RL). Despite its success, the theoretical foundations of TS remain limited, particularly in settings with complex temporal structure such as RL. We address this gap by establishing no-regret guarantees for TS using models with Gaussian marginal distributions. Specifically, we consider TS in episodic RL with joint Gaussian process (GP) priors over rewards and transitions. We prove a regret bound of O~(KHΓ(KH))\mathcal{\tilde{O}}(\sqrt{KH\Gamma(KH)}) over KK episodes of horizon HH, where Γ()\Gamma(\cdot) captures the complexity of the GP model. Our analysis addresses several challenges, including the non-Gaussian nature of value functions and the recursive structure of Bellman updates, and extends classical tools such as the elliptical potential lemma to multi-output settings. This work advances the understanding of TS in RL and highlights how structural assumptions and model uncertainty shape its performance in finite-horizon Markov Decision Processes.

View on arXiv
Comments on this paper