Speeding Up MCMC by Efficient Data Subsampling
We propose a Markov Chain Monte Carlo (MCMC) framework where the likelihood function for observations is estimated from a random subset of observations. Inspired by the survey sampling literature, we introduce a general and highly efficient log-likelihood estimator. The estimator incorporates information about each observation's contribution to the log-likelihood function. The computational complexity of the estimator can be much smaller than for the full log-likelihood, and we document substantial speed-ups in the applications. The likelihood estimate is used within a Pseudo-marginal framework to sample from a perturbed posterior which we prove to be within of the true posterior. Moreover, the approximation error is demonstrated to be negligible even for a small in our applications. We propose a simple way to adaptively choose the sample size during the MCMC to optimize sampling efficiency for a fixed computational budget. We also propose a correlated pseudo marginal approach to subsampling that dramatically improves performance. The method is illustrated on three examples, each one representing a different data structure. In particular, we show that our method outperforms other subsampling MCMC methods proposed in the literature.
View on arXiv