Node harvest: simple and interpretable regression and classification

12 October 2009

Abstract

When choosing a suitable technique for regression and classification with multivariate predictor variables, one is often faced with a tradeoff between interpretability and high predictive accuracy. To give a classical example, classification and regression trees are easy to understand and interpret. Tree ensembles like Random Forests provide, on the other hand, usually more accurate predictions. Yet tree ensembles are also more difficult to analyze than single trees and are, perhaps unfairly, often criticized as `black box' predictors. `Node harvest' is trying to reconcile the two aims of interpretability and predictive accuracy by combining positive aspects of trees and tree ensembles. The procedure is very simple: an initial set of a few thousand nodes is generated randomly. If a new observation falls into just a single node, its prediction is the mean response of all training observation within this node, identical to a tree-like prediction. However, a new observation falls typically into several nodes and its prediction is then the weighted average of the mean responses across all these nodes. The only role of `node harvest' is to `pick' suitable nodes from an initial large ensemble of nodes. Each node receives a non-negative weight and the allocation of weights amounts in the proposed algorithm to a quadratic programming problem with linear inequality constraints. The solution is sparse in the sense that only very few nodes are selected with a non-zero weight. It is not necessary to select a tuning parameter for optimal predictive accuracy. `Node harvest' can handle mixed data and missing values well and is shown to be simple to interpret and competitive in predictive accuracy on a variety of datasets, with special attention given to an application in climate modelling.

View on arXiv

Comments on this paper