122

PECOK: a convex optimization approach to variable clustering

Abstract

The problem of variable clustering is that of grouping similar components of a pp-dimensional vector X=(X1,,Xp)X=(X_{1},\ldots,X_{p}), and estimating these groups from nn independent copies of XX. When cluster similarity is defined via GG-latent models, in which groups of XX-variables have a common latent generator, and groups are relative to a partition GG of the index set {1,,p}\{1, \ldots, p\}, the most natural clustering strategy is KK-means. We explain why this strategy cannot lead to perfect cluster recovery and offer a correction, based on semi-definite programing, that can be viewed as a penalized convex relaxation of KK-means (PECOK). We introduce a cluster separation measure tailored to GG-latent models, and derive its minimax lower bound for perfect cluster recovery. The clusters estimated by PECOK are shown to recover GG at a near minimax optimal cluster separation rate, a result that holds true even if KK, the number of clusters, is estimated adaptively from the data. We compare PECOK with appropriate corrections of spectral clustering-type procedures, and show that the former outperforms the latter for perfect cluster recovery of minimally separated clusters.

View on arXiv
Comments on this paper