Sharp Analysis of a Simple Model for Random Forests

7 May 2018

Abstract

Random forests have become an important tool for improving accuracy in regression problems since their popularization by [Breiman, 2001] and others. In this paper, we revisit a random forest model originally proposed by [Breiman, 2004] and later studied by [Biau, 2012], where a feature is selected at random and the split occurs at the midpoint of the box containing the chosen feature. If the Lipschitz regression function is sparse and only depends on a small, unknown subset of $S$ out of $d$ features, we show that, given access to $n$ observations, this random forest model outputs a predictor that has a mean-squared prediction error $O((n(\sqrt{\log n})^{S-1})^{-\frac{1}{S\log2+1}})$ . This positively answers an outstanding question of [Biau, 2012] about whether the rate of convergence therein could be improved. The second part of this article shows that the aforementioned prediction error cannot generally be improved, which we accomplish by characterizing the variance and by showing that the bias is tight for any linear model with nonzero parameter vector. As a striking consequence of our analysis, we show the variance of this forest is similar in form to the best-case variance lower bound of [Lin and Jeon, 2006] among all random forest models with nonadaptive splitting schemes (i.e., where the split protocol is independent of the training data).

View on arXiv

Comments on this paper