Speeding Up MCMC by Efficient Data Subsampling

Journal of the American Statistical Association (JASA), 2014

16 April 2014

Abstract

We propose a Markov Chain Monte Carlo (MCMC) framework where the likelihood function for $n$ observations is estimated from a random subset of $m$ observations. Inspired by the survey sampling literature, we introduce a general and highly efficient log-likelihood estimator. The estimator incorporates information about each observation's contribution to the log-likelihood function. The computational complexity of the estimator can be much smaller than for the full log-likelihood, and we document substantial speed-ups in the applications. The likelihood estimate is used within a Pseudo-marginal framework to sample from a perturbed posterior which we prove to be within $O(m^{-1/2})$ of the true posterior. Moreover, the approximation error is demonstrated to be negligible even for a small $m$ in our applications. We propose a simple way to adaptively choose the sample size $m$ during the MCMC to optimize sampling efficiency for a fixed computational budget. We also propose a correlated pseudo marginal approach to subsampling that dramatically improves performance. The method is illustrated on three examples, each one representing a different data structure. In particular, we show that our method outperforms other subsampling MCMC methods proposed in the literature.

View on arXiv

Comments on this paper