Sets Clustering

The input to the \emph{sets--means} problem is an integer and a set of sets in . The goal is to compute a set of centers (points) in that minimizes the sum of squared distances to these sets. An \emph{-core-set} for this problem is a weighted subset of that approximates this sum up to factor, for \emph{every} set of centers in . We prove that such a core-set of sets always exists, and can be computed in time, for every input and every fixed and . The result easily generalized for any metric space, distances to the power of , and M-estimators that handle outliers. Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS ( approximation) for the sets--means problem that takes time near linear in . This is the first result even for sets-mean on the plane (, ). Open source code and experimental results for document classification and facility locations are also provided.
View on arXiv