11
22

Sets Clustering

Abstract

The input to the \emph{sets-kk-means} problem is an integer k1k\geq 1 and a set P={P1,,Pn}\mathcal{P}=\{P_1,\cdots,P_n\} of sets in Rd\mathbb{R}^d. The goal is to compute a set CC of kk centers (points) in Rd\mathbb{R}^d that minimizes the sum PPminpP,cCpc2\sum_{P\in \mathcal{P}} \min_{p\in P, c\in C}\left\| p-c \right\|^2 of squared distances to these sets. An \emph{ε\varepsilon-core-set} for this problem is a weighted subset of P\mathcal{P} that approximates this sum up to 1±ε1\pm\varepsilon factor, for \emph{every} set CC of kk centers in Rd\mathbb{R}^d. We prove that such a core-set of O(log2n)O(\log^2{n}) sets always exists, and can be computed in O(nlogn)O(n\log{n}) time, for every input P\mathcal{P} and every fixed d,k1d,k\geq 1 and ε(0,1)\varepsilon \in (0,1). The result easily generalized for any metric space, distances to the power of z>0z>0, and M-estimators that handle outliers. Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS (1+ε1+\varepsilon approximation) for the sets-kk-means problem that takes time near linear in nn. This is the first result even for sets-mean on the plane (k=1k=1, d=2d=2). Open source code and experimental results for document classification and facility locations are also provided.

View on arXiv
Comments on this paper