Posterior Sampling for Continuing Environments

29 November 2022

Wanqiao Xu

Shi Dong

Benjamin Van Roy

ArXiv (abs)PDF HTML

Main:8 Pages

2 Figures

Bibliography:3 Pages

Appendix:5 Pages

Abstract

We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach maintains a statistically plausible model of the environment and follows a policy that maximizes expected $\gamma$ -discounted return in that model. At each time, with probability $1-\gamma$ , the model is replaced by a sample from the posterior distribution over environments. For a suitable schedule of $\gamma$ , we establish an $\tilde{O}(\tau S \sqrt{A T})$ bound on the Bayesian regret, where $S$ is the number of environment states, $A$ is the number of actions, and $\tau$ denotes the reward averaging time, which is a bound on the duration required to accurately estimate the average reward of any policy.

View on arXiv

Comments on this paper