72
51

A Unified Framework for Clustering Constrained Data without Locality Property

Abstract

In this paper, we consider a class of constrained clustering problems of points in Rd\mathbb{R}^{d}, where dd could be rather high. A common feature of these problems is that their optimal clusterings no longer have the locality property (due to the additional constraints), which is a key property required by many algorithms for their unconstrained counterparts. To overcome the difficulty caused by the loss of locality, we present in this paper a unified framework, called {\em Peeling-and-Enclosing (PnE)}, to iteratively solve two variants of the constrained clustering problems, {\em constrained kk-means clustering} (kk-CMeans) and {\em constrained kk-median clustering} (kk-CMedian). Our framework is based on two standalone geometric techniques, called {\em Simplex Lemma} and {\em Weaker Simplex Lemma}, for kk-CMeans and kk-CMedian, respectively. The simplex lemma (or weaker simplex lemma) enables us to efficiently approximate the mean (or median) point of an unknown set of points by searching a small-size grid, independent of the dimensionality of the space, in a simplex (or the surrounding region of a simplex), and thus can be used to handle high dimensional data. If kk and 1ϵ\frac{1}{\epsilon} are fixed numbers, our framework generates, in nearly linear time ({\em i.e.,} O(n(logn)k+1d)O(n(\log n)^{k+1}d)), O((logn)k)O((\log n)^{k}) kk-tuple candidates for the kk mean or median points, and one of them induces a (1+ϵ)(1+\epsilon)-approximation for kk-CMeans or kk-CMedian, where nn is the number of points. Combining this unified framework with a problem-specific selection algorithm (which determines the best kk-tuple candidate), we obtain a (1+ϵ)(1+\epsilon)-approximation for each of the constrained clustering problems. We expect that our technique will be applicable to other constrained clustering problems without locality.

View on arXiv
Comments on this paper