228

Semidefinite programming on population clustering: a local analysis

Abstract

In this paper, we consider the problem of partitioning a small data sample of size nn drawn from a mixture of 22 sub-gaussian distributions. In particular, we design and analyze two computational efficient algorithms to partition data into two groups approximately according to their population of origin given a small sample in a recent paper (Zhou 2023a). Our work is motivated by the application of clustering individuals according to their population of origin using markers, when the divergence between any two of the populations is small. Moreover, we are interested in the case that individual features are of low average quality γ\gamma, and we want to use as few of them as possible to correctly partition the sample. Here we use pγp \gamma to denote the 22\ell_2^2 distance between two population centers (mean vectors), namely, μ(1)\mu^{(1)}, μ(2)\mu^{(2)} \in Rp{\mathbb R}^p. We allow a full range of tradeoffs between n,p,γn, p, \gamma in the sense that partial recovery (success rate <100%< 100\%) is feasible once the signal to noise ratio s2:=min{npγ2,pγ}s^2 := \min\{np \gamma^2, p \gamma\} is lower bounded by a constant. Our work builds upon the semidefinite relaxation of an integer quadratic program that is formulated essentially as finding the maximum cut on a graph, where edge weights in the cut represent dissimilarity scores between two nodes based on their pp features in Zhou (2023a). More importantly, we prove that the misclassification error decays exponentially with respect to the SNR s2s^2 in the present paper. The significance of such an exponentially decaying error bound is: when s2=Ω(logn)s^2 =\Omega(\log n), perfect recovery of the cluster structure is accomplished. This result was introduced in Zhou (2023a) without a proof. We therefore present the full proof in the present work.

View on arXiv
Comments on this paper