101

Diffusion Approximations for Thompson Sampling

Main:41 Pages
Bibliography:3 Pages
1 Tables
Appendix:1 Pages
Abstract

We study the behavior of Thompson sampling from the perspective of weak convergence. In the regime where the gaps between arm means scale as 1/n1/\sqrt{n} with the time horizon nn, we show that the dynamics of Thompson sampling evolve according to discrete versions of SDEs and random ODEs. As nn \to \infty, we show that the dynamics converge weakly to solutions of the corresponding SDEs and random ODEs. (Recently, Wager and Xu (arXiv:2101.09855) independently proposed this regime and developed similar SDE and random ODE approximations.) Our weak convergence theory covers both the classical multi-armed and linear bandit settings, and can be used, for instance, to obtain insight about the characteristics of the regret distribution when there is information sharing among arms, as well as the effects of variance estimation, model mis-specification and batched updates in bandit learning. Our theory is developed from first-principles and can also be adapted to analyze other sampling-based bandit algorithms.

View on arXiv
Comments on this paper