56
22

Complete Analysis of a Random Forest Model

Abstract

Random forests have become an important tool for improving accuracy in regression problems since their popularization by (Breiman, 2001) and others. In this paper, we revisit a random forest model originally proposed by (Breiman, 2004) and later studied by (Biau, 2012), where a feature is selected at random and the split occurs at the midpoint of the box containing the chosen feature. If the Lipschitz regression function is sparse and depends only on a small, unknown subset of SS out of dd features, we show that given nn observations, this random forest model outputs a predictor that has a mean-squared prediction error of order (n(logn)S1)1Slog2+1(n(\sqrt{\log n})^{S-1})^{-\frac{1}{S\log2+1}}. When S0.72dS \leq \lfloor 0.72 d \rfloor, this rate is better than the minimax optimal rate n2d+2n^{-\frac{2}{d+2}} for dd-dimensional, Lipschitz function classes. The second part of this article shows that the prediction error for this random forest model cannot generally be improved. As a striking consequence of our analysis, we show that if avg\ell_{avg} (resp. max\ell_{max}) is the average (resp. maximum) number of observations per leaf node, then the variance of this forest is Θ(avg1(logn)(S1))\Theta(\ell^{-1}_{avg}(\sqrt{\log n})^{-(S-1)}), which for the case of S=dS = d, is similar in form to the lower bound Ω(max1(logn)(d1))\Omega(\ell^{-1}_{max}(\log n)^{-(d-1)}) of (Lin and Jeon, 2006) for any random forest model with a nonadaptive splitting scheme. We also show that the bias is tight for any linear model with nonzero parameter vector. Our new analysis also implies that better theoretical performance can be achieved if the trees are grown to a shallower depth than previous work would otherwise recommend.

View on arXiv
Comments on this paper