ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1704.06977
167
39
v1v2v3v4 (latest)

Overlapping Variable Clustering with Statistical Guarantees

23 April 2017
Xin Bing
F. Bunea
Y. Ning
ArXiv (abs)PDFHTML
Abstract

Variable clustering is one of the most important unsupervised learning methods, ubiquitous in most research areas. In the statistics and computer science literature, most of the clustering methods lead to non-overlapping partitions of the variables. However, in many applications, some variables may belong to multiple groups, yielding clusters with overlap. It is still largely unknown how to perform overlapping variable clustering with statistical guarantees. To bridge this gap, we propose a novel Latent model-based OVErlapping clustering method (LOVE) to recover overlapping sub-groups of a potentially very large group of variables. In our model-based formulation, a cluster is given by variables associated with the same latent factor, and can be determined from an allocation matrix A that indexes our proposed latent model. We assume that some of the observed variables are pure, in that they are associated with only one latent factor, whereas the remaining majority has multiple allocations. We prove that the corresponding allocation matrix A, and the induced overlapping clusters, are identifiable, up to label switching. We estimate the clusters with LOVE, our newly developed algorithm, which consists in two steps. The first step estimates the set of pure variables, and the number of clusters. In the second step we estimate the allocation matrix A and determine the overlapping clusters. Under minimal signal strength conditions, our algorithm recovers the population level clusters consistently. Our theoretical results are fully supported by our empirical studies, which include extensive simulation studies that compare LOVE with other existing methods, and the analysis of a RNA-seq dataset.

View on arXiv
Comments on this paper