The -means is a popular clustering objective that is NP-hard in the worst-case but often solved efficiently by simple heuristics in practice. The implicit assumption behind using the -means (or many other objectives) is that an optimal solution would recover the underlying ground truth clustering. In most real-world datasets, the underlying ground-truth clustering is unambiguous and stable under small perturbations of data. As a consequence, the ground-truth clustering satisfies center proximity, that is, every point is closer to the center of its own cluster than the center of any other cluster, by some multiplicative factor . We study the problem of minimizing the Euclidean -means objective only over clusterings that satisfy -center proximity. We give a simple algorithm to find an exact optimal clustering for the above objective with running time exponential in and but linear in the number of points and the dimension. We define an analogous -center proximity condition for outliers, and give similar algorithmic guarantees for -means with outliers and -center proximity. On the hardness side we show that for any , there exists an , , and an such that minimizing the -means objective over clusterings that satisfy -center proximity is NP-hard to approximate within a multiplicative factor.
View on arXiv