Complete Analysis of a Random Forest Model
- FAtt
Random forests have become an important tool for improving accuracy in regression problems since their popularization by [Breiman, 2001] and others. In this paper, we revisit a random forest model originally proposed by [Breiman, 2004] and later studied by [Biau, 2012], where a feature is selected at random and the split occurs at the midpoint of the box containing the chosen feature. If the Lipschitz regression function is sparse and only depends on a small, unknown subset of out of features, we show that given observations, this random forest model outputs a predictor that has a mean-squared prediction error . When , this rate is significantly better than the minimax optimal rate for Lipschitz function classes in $ d $ dimensions. The second part of this article shows that the prediction error for this random forest model cannot generally be improved. As a striking consequence of our analysis, we show that if (resp. ) is the average (resp. maximum) number of observations per leaf node, then the variance of this forest is . When , this variance is similar in form to the best-case variance lower bound of [Lin and Jeon, 2006] for any random forest model with a nonadaptive splitting scheme (i.e., where the split protocol is independent of the data). We also show that the bias is tight for any linear model with nonzero parameter vector. Finally, a side consequence of our analysis is that if the regression function is square-integrable (e.g., it need not be continuous or bounded), then the random forest predictor is pointwise consistent almost everywhere.
View on arXiv