A Probabilistic Method for Clustering High Dimensional Data
In general, the clustering problem is NP-hard, and global optimality cannot be established for non-trivial instances. For high-dimensional data, distance-based methods for clustering or classification face an additional difficulty, the unreliability of distances in very high-dimensional spaces. We propose a distance-based iterative method for clustering data in very high-dimensional space, using the -metric that is less sensitive to high dimensionality than the Euclidean distance. For clusters in , the problem decomposes to problems coupled by probabilities, and an iteration reduces to finding weighted medians of points on a line. The complexity of the algorithm is linear in the dimension of the data space, and its performance was observed to improve significantly as the dimension increases.
View on arXiv