19
0

A Fast Algorithm for Clustering High Dimensional Feature Vectors

Abstract

We propose an algorithm for clustering high dimensional data. If PP features for NN objects are represented in an N×PN\times P matrix X{\bf X}, where NPN\ll P, the method is based on exploiting the cluster-dependent structure of the N×NN\times N matrix XXT{\bf XX}^T. Computational burden thus depends primarily on NN, the number of objects to be clustered, rather than PP, the number of features that are measured. This makes the method particularly useful in high dimensional settings, where it is substantially faster than a number of other popular clustering algorithms. Aside from an upper bound on the number of potential clusters, the method is independent of tuning parameters. When compared to 1616 other clustering algorithms on 3232 genomic datasets with gold standards, we show that it provides the most accurate cluster configuration more than twice as often than its closest competitors. We illustrate the method on data taken from highly cited genomic studies.

View on arXiv
Comments on this paper