Complete Analysis of a Random Forest Model
- FAtt

Random forests have become an important tool for improving accuracy in regression problems since their popularization by [Breiman, 2001] and others. In this paper, we revisit a random forest model originally proposed by [Breiman, 2004] and later studied by [Biau, 2012], where a feature is selected at random and the split occurs at the midpoint of the box containing the chosen feature. If the Lipschitz regression function is sparse and only depends on a small, unknown subset of out of features, we show that given observations, this random forest model outputs a predictor that has a mean-squared prediction error . When , this rate is significantly better than the minimax optimal rate for Lipschitz function classes in dimensions. The second part of this article shows that the prediction error for this random forest model cannot generally be improved. As a striking consequence of our analysis, we show that if (resp. ) is the average (resp. maximum) number of observations per leaf node, then the variance of this forest is . When , this variance is similar in form to the best-case variance lower bound of [Lin and Jeon, 2006] for any random forest model with a nonadaptive splitting scheme (i.e., where the split protocol is independent of the data). We also show that the bias is tight for any linear model with nonzero parameter vector. Finally, a side consequence of our analysis is that if the regression function is square-integrable (e.g., it need not be continuous or bounded), then the random forest predictor is pointwise consistent almost everywhere.
View on arXiv