247

Learning Zero-sum Stochastic Games with Posterior Sampling

International Conference on Artificial Intelligence and Statistics (AISTATS), 2021
Abstract

In this paper, we propose Posterior Sampling Reinforcement Learning for Zero-sum Stochastic Games (PSRL-ZSG), the first online learning algorithm that achieves Bayesian regret bound of O(HSAT)O(HS\sqrt{AT}) in the infinite-horizon zero-sum stochastic games with average-reward criterion. Here HH is an upper bound on the span of the bias function, SS is the number of states, AA is the number of joint actions and TT is the horizon. We consider the online setting where the opponent can not be controlled and can take any arbitrary time-adaptive history-dependent strategy. This improves the best existing regret bound of O(DS2AT23)O(\sqrt[3]{DS^2AT^2}) by Wei et. al., 2017 under the same assumption and matches the theoretical lower bound in AA and TT.

View on arXiv
Comments on this paper