64
3

Graph sketching-based Space-efficient Data Clustering

Abstract

In this paper, we address the problem of recovering arbitrary-shaped data clusters from datasets while facing high space constraints, as this is for instance the case in the Internet of Things environment when analysis algorithms are directly deployed on resources-limited mobile devices collecting the data. We present DBMSTClu a new density-based \emph{non-parametric} method working on a limited number of linear measurements i.e. a \emph{sketched} version of the dissimilarity graph GG between the NN objects to cluster. Unlike kk-means, kk-medians or kk-medoids algorithms, it does not fail at distinguishing clusters with particular structures. No input parameter is needed contrarily to DBSCAN or the Spectral Clustering method. DBMSTClu as a graph-based technique relies on the dissimilarity graph GG which costs theoretically O(N2)O(N^2) in memory. However, our algorithm follows the dynamic semi-streaming model by handling GG as a stream of edge weight updates and sketches it in one pass over the data into a compact structure requiring O(Npolylog(N))O(N \operatorname{polylog}(N)) space. Thanks to the property of the Minimum Spanning Tree (MST) for expressing the underlying structure of a graph, our algorithm successfully detects the right number of non-convex clusters by recovering an approximate MST from the graph sketch of GG. We provide theoretical guarantees on the quality of the clustering partition and also demonstrate its advantage over the existing state-of-the-art on several datasets.

View on arXiv
Comments on this paper