16
7

Phase Transitions in the Detection of Correlated Databases

Abstract

We study the problem of detecting the correlation between two Gaussian databases XRn×d\mathsf{X}\in\mathbb{R}^{n\times d} and Yn×d\mathsf{Y}^{n\times d}, each composed of nn users with dd features. This problem is relevant in the analysis of social media, computational biology, etc. We formulate this as a hypothesis testing problem: under the null hypothesis, these two databases are statistically independent. Under the alternative, however, there exists an unknown permutation σ\sigma over the set of nn users (or, row permutation), such that X\mathsf{X} is ρ\rho-correlated with Yσ\mathsf{Y}^\sigma, a permuted version of Y\mathsf{Y}. We determine sharp thresholds at which optimal testing exhibits a phase transition, depending on the asymptotic regime of nn and dd. Specifically, we prove that if ρ2d0\rho^2d\to0, as dd\to\infty, then weak detection (performing slightly better than random guessing) is statistically impossible, irrespectively of the value of nn. This compliments the performance of a simple test that thresholds the sum all entries of XTY\mathsf{X}^T\mathsf{Y}. Furthermore, when dd is fixed, we prove that strong detection (vanishing error probability) is impossible for any ρ<ρ\rho<\rho^\star, where ρ\rho^\star is an explicit function of dd, while weak detection is again impossible as long as ρ2d0\rho^2d\to0. These results close significant gaps in current recent related studies.

View on arXiv
Comments on this paper