Sets Clustering

9 March 2020

Abstract

The input to the \emph{sets- $k$ -means} problem is an integer $k\geq 1$ and a set $\mathcal{P}=\{P_1,\cdots,P_n\}$ of sets in $\mathbb{R}^d$ . The goal is to compute a set $C$ of $k$ centers (points) in $\mathbb{R}^d$ that minimizes the sum $\sum_{P\in \mathcal{P}} \min_{p\in P, c\in C}\left\| p-c \right\|^2$ of squared distances to these sets. An \emph{ $\varepsilon$ -core-set} for this problem is a weighted subset of $\mathcal{P}$ that approximates this sum up to $1\pm\varepsilon$ factor, for \emph{every} set $C$ of $k$ centers in $\mathbb{R}^d$ . We prove that such a core-set of $O(\log^2{n})$ sets always exists, and can be computed in $O(n\log{n})$ time, for every input $\mathcal{P}$ and every fixed $d,k\geq 1$ and $\varepsilon \in (0,1)$ . The result easily generalized for any metric space, distances to the power of $z>0$ , and M-estimators that handle outliers. Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS ( $1+\varepsilon$ approximation) for the sets- $k$ -means problem that takes time near linear in $n$ . This is the first result even for sets-mean on the plane ( $k=1$ , $d=2$ ). Open source code and experimental results for document classification and facility locations are also provided.

View on arXiv

Comments on this paper