Towards O(1) Seeding of K-Means

18 November 2015

Abstract

K-means is one of the most widely used algorithms for clustering in Data Mining applications, which attempts to minimize the sum of square of Euclidean distance of the points in the clusters from the respective means of the clusters. The simplicity and scalability of K-means makes it very appealing. However, K-means suffers from local minima problem, and comes with no guarantee to converge to the optimal cost. K-means++ tries to address the problem by seeding the means using a distance based sampling scheme. However, seeding the means in K-means++ needs $O\left(K\right)$ passes through the entire dataset. This could be very costly in large amount of dataset. Here we propose a method of seeding initial means based on factorizations of higher order moments for bounded data. Our method takes O(1) passes through the entire dataset to extract the initial set of means, and its final cost can be proven to be within $O(\sqrt{K})$ of the optimal cost. We demonstrate the performance of our algorithm in comparison with the existing algorithms on various benchmark datasets.

View on arXiv

Comments on this paper