Feature selection in "omics" prediction problems using cat scores and false non-discovery rate control

11 March 2009

Abstract

We revisit the problem of feature selection in linear discriminant analysis (LDA), i.e. when features are correlated. First, we introduce a pooled centroids formulation of the multi-class LDA predictor function, in which the relative weights of Mahalanobis-tranformed predictors are given by correlation-adjusted t-scores (cat scores). Second, for feature selection we propose thresholding cat scores by controlling false non-discovery rates (FNDR). We show that contrary to previous claims this FNDR procedures performs very well and similar to ``higher criticism''. Third, training of the classifier function is conducted by plugin of James-Stein shrinkage estimates of correlations and variances, using analytic procedures for choosing regularization parameters. Overall, this results in an effective and computationally inexpensive framework for high-dimensional prediction with natural feature selection. The proposed shrinkage discriminant procedures are implemented in the R package ``sda'' available from the R repository CRAN.

View on arXiv

Comments on this paper