PECOK: a convex optimization approach to variable clustering

The problem of variable clustering is that of grouping similar components of a -dimensional vector , and estimating these groups from independent copies of . When cluster similarity is defined via -latent models, in which groups of -variables have a common latent generator, and groups are relative to a partition of the index set , the most natural clustering strategy is -means. We explain why this strategy cannot lead to perfect cluster recovery and offer a correction, based on semi-definite programing, that can be viewed as a penalized convex relaxation of -means (PECOK). We introduce a cluster separation measure tailored to -latent models, and derive its minimax lower bound for perfect cluster recovery. The clusters estimated by PECOK are shown to recover at a near minimax optimal cluster separation rate, a result that holds true even if , the number of clusters, is estimated adaptively from the data. We compare PECOK with appropriate corrections of spectral clustering-type procedures, and show that the former outperforms the latter for perfect cluster recovery of minimally separated clusters.
View on arXiv