Phase Transitions in the Detection of Correlated Databases

7 February 2023

Abstract

We study the problem of detecting the correlation between two Gaussian databases $\mathsf{X}\in\mathbb{R}^{n\times d}$ and $\mathsf{Y}^{n\times d}$ , each composed of $n$ users with $d$ features. This problem is relevant in the analysis of social media, computational biology, etc. We formulate this as a hypothesis testing problem: under the null hypothesis, these two databases are statistically independent. Under the alternative, however, there exists an unknown permutation $\sigma$ over the set of $n$ users (or, row permutation), such that $\mathsf{X}$ is $\rho$ -correlated with $\mathsf{Y}^\sigma$ , a permuted version of $\mathsf{Y}$ . We determine sharp thresholds at which optimal testing exhibits a phase transition, depending on the asymptotic regime of $n$ and $d$ . Specifically, we prove that if $\rho^2d\to0$ , as $d\to\infty$ , then weak detection (performing slightly better than random guessing) is statistically impossible, irrespectively of the value of $n$ . This compliments the performance of a simple test that thresholds the sum all entries of $\mathsf{X}^T\mathsf{Y}$ . Furthermore, when $d$ is fixed, we prove that strong detection (vanishing error probability) is impossible for any $\rho<\rho^\star$ , where $\rho^\star$ is an explicit function of $d$ , while weak detection is again impossible as long as $\rho^2d\to0$ . These results close significant gaps in current recent related studies.

View on arXiv

Comments on this paper