Supervised Dimensionality Reduction for Big Data

5 September 2017

Joshua T. Vogelstein

Abstract

To solve key biomedical problems, experimentalists now routinely measure millions or billions of features (dimensions) per sample, with the hope that data science techniques will be able to build accurate data-driven inferences. Because sample sizes are typically orders of magnitude smaller than the dimensionality of these data, valid inferences require finding a low-dimensional representation that preserves the discriminating information (e.g., whether the individual suffers from a particular disease). Existing linear and nonlinear dimensionality reduction methods either are not supervised, scale poorly to operate in big data regimes, lack theoretical guarantees, or are "black-box" methods unsuitable for many applications. We introduce "Linear Optimal Low-rank" projection (LOL), which extends principle components analysis by incorporating, rather than ignoring, class labels, and facilitates straightforward generalizations. We prove, and substantiate with both synthetic and real data benchmarks, that LOL leads to an improved data representation for subsequent classification, while maintaining computational efficiency and scalability. Using multiple brain imaging datasets consisting of >150 million features, and several genomics datasets with >500,000 features, LOL achieves achieves state-of-the-art classification accuracy, while only requiring a few minutes on a standard desktop computer.

View on arXiv

Comments on this paper