On the Use of Harrell's C for Node Splitting in Random Survival Forests

Random forests are one of the most successful methods for statistical learning and prediction. Here we consider random survival forests (RSF), which are an extension of the original random forest method to right-censored outcome variables. RSF use the log-rank split criterion to form an ensemble of survival trees, the prediction accuracy of the ensemble estimate is subsequently evaluated by the concordance index for survival data ("Harrell's C"). Conceptually, this strategy means that the split criterion in RSF is different from the evaluation criterion of interest. In view of this discrepancy, we analyze the theoretical relationship between the two criteria and investigate whether a unified strategy that uses Harrell's C for both node splitting and evaluation is able to improve the performance of RSF. Based on simulation studies and the analysis of real-world data, we show that substantial performance gains are possible if the log-rank statistic is replaced by Harrell's C for node splitting in RSF. Our results also show that C-based splitting is not superior to log-rank splitting if the percentage of noise variables is high, a result which can be attributed to the more unbalanced splits that are generated by the log-rank statistic.
View on arXiv