235

Posterior Sampling for Continuing Environments

Main:8 Pages
2 Figures
Bibliography:3 Pages
Appendix:5 Pages
Abstract

We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach maintains a statistically plausible model of the environment and follows a policy that maximizes expected γ\gamma-discounted return in that model. At each time, with probability 1γ1-\gamma, the model is replaced by a sample from the posterior distribution over environments. For a suitable schedule of γ\gamma, we establish an O~(τSAT)\tilde{O}(\tau S \sqrt{A T}) bound on the Bayesian regret, where SS is the number of environment states, AA is the number of actions, and τ\tau denotes the reward averaging time, which is a bound on the duration required to accurately estimate the average reward of any policy.

View on arXiv
Comments on this paper