A Unified Framework for Clustering Constrained Data without Locality Property

2 October 2018

Abstract

In this paper, we consider a class of constrained clustering problems of points in $\mathbb{R}^{d}$ , where $d$ could be rather high. A common feature of these problems is that their optimal clusterings no longer have the locality property (due to the additional constraints), which is a key property required by many algorithms for their unconstrained counterparts. To overcome the difficulty caused by the loss of locality, we present in this paper a unified framework, called {\em Peeling-and-Enclosing (PnE)}, to iteratively solve two variants of the constrained clustering problems, {\em constrained $k$ -means clustering} ( $k$ -CMeans) and {\em constrained $k$ -median clustering} ( $k$ -CMedian). Our framework is based on two standalone geometric techniques, called {\em Simplex Lemma} and {\em Weaker Simplex Lemma}, for $k$ -CMeans and $k$ -CMedian, respectively. The simplex lemma (or weaker simplex lemma) enables us to efficiently approximate the mean (or median) point of an unknown set of points by searching a small-size grid, independent of the dimensionality of the space, in a simplex (or the surrounding region of a simplex), and thus can be used to handle high dimensional data. If $k$ and $\frac{1}{\epsilon}$ are fixed numbers, our framework generates, in nearly linear time ({\em i.e.,} $O(n(\log n)^{k+1}d)$ ), $O((\log n)^{k})$ $k$ -tuple candidates for the $k$ mean or median points, and one of them induces a $(1+\epsilon)$ -approximation for $k$ -CMeans or $k$ -CMedian, where $n$ is the number of points. Combining this unified framework with a problem-specific selection algorithm (which determines the best $k$ -tuple candidate), we obtain a $(1+\epsilon)$ -approximation for each of the constrained clustering problems. We expect that our technique will be applicable to other constrained clustering problems without locality.

View on arXiv

Comments on this paper