Data-generating models under which the random forest algorithm performs badly

Computational statistics (Zeitschrift) (Comput. Stat.), 2019

2 October 2019

Abstract

Examples are given of data-generating models under which some versions of the random forest algorithm may fail to be consistent or at least may be extremely slow to converge to the optimal predictor. Evidence provided for these properties is based on partly intuitive and partly rigorous arguments and on numerical experiments. Although one can always choose a model under which random forests perform very badly, in each case simple methods based on statistics of `variable use' and `variable importance' can be used to construct a better predictor based on a sort of mixture of random forests.

View on arXiv

Comments on this paper