ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1703.02375
73
3
v1v2v3v4v5 (latest)

Graph sketching-based Massive Data Clustering

7 March 2017
Anne Morvan
K. Choromanski
Cédric Gouy-Pailler
Jamal Atif
ArXiv (abs)PDFHTML
Abstract

In this paper, we address the problem of recovering arbitrary-shaped data clusters from massive datasets. We present DBMSTClu a new density-based non-parametric method working on a limited number of linear measurements i.e. a sketched version of the similarity graph GGG between the NNN objects to cluster. Unlike kkk-means, kkk-medians or kkk-medoids algorithms, it does not fail at distinguishing clusters with particular structures. No input parameter is needed contrarily to DBSCAN or the Spectral Clustering method. DBMSTClu as a graph-based technique relies on the similarity graph GGG which costs theoretically O(N2)O(N^2)O(N2) in memory. However, our algorithm follows the dynamic semi-streaming model by handling GGG as a stream of edge weight updates and sketches it in one pass over the data into a compact structure requiring O(poly⁡log⁡(N))O(\operatorname{poly} \operatorname{log} (N))O(polylog(N)) space. Thanks to the property of the Minimum Spanning Tree (MST) for expressing the underlying structure of a graph, our algorithm successfully detects the right number of non-convex clusters by recovering an approximate MST from the graph sketch of GGG. We provide theoretical guarantees on the quality of the clustering partition and also demonstrate its advantage over the existing state-of-the-art on several datasets.

View on arXiv
Comments on this paper