Tree boosting for learning probability measures
Learning probability measures based on an i.i.d. sample is a fundamental inference task in statistics, but is difficult when the sample size is high-dimensional due to the so-called curse of dimensionality. Inspired by the success of tree boosting methods for overcoming high-dimensionality in classification and regression, this paper proposes a new boosting method for learning probability distributions. To construct an additive ensemble of weak learners, we start by defining "a sum" of univariate probability measures in terms of nested compositions of cumulative distribution function (CDF) transforms, and then introduce a new notion of tree-based CDFs for multivariate sample spaces that generalizes such addition to multivariate distributions. This new rule gives rise to a simple boosting algorithm based on forward-stagewise (FS) fitting, which resembles the classical algorithm for boosting in supervised learning. The output of the FS algorithm allows analytic computation of the probability density function for the fitted distribution as well as providing an exact simulator from the fitted measure. While the algorithm can be applied in conjunction with a variety of different tree-based weak learners, we demonstrate its work in our numerical examples using a \Polya tree weak learner, and illustrate how the typical considerations in applying boosting -- namely choosing the number of trees, setting the appropriate level of shrinkage/regularization in the weak learner, and the evaluation of variable importance -- can be all accomplished in an analogous fashion to traditional boosting. Our numerical experiments confirm that boosting can substantially improve the fit to multivariate distributions compared to the state-of-the-art single-tree learner and is computationally efficient. We also illustrate through an application to a 19-dimensional data set from flow cytometry.
View on arXiv