Phase Transitions in the Detection of Correlated Databases

We study the problem of detecting the correlation between two Gaussian databases and , each composed of users with features. This problem is relevant in the analysis of social media, computational biology, etc. We formulate this as a hypothesis testing problem: under the null hypothesis, these two databases are statistically independent. Under the alternative, however, there exists an unknown permutation over the set of users (or, row permutation), such that is -correlated with , a permuted version of . We determine sharp thresholds at which optimal testing exhibits a phase transition, depending on the asymptotic regime of and . Specifically, we prove that if , as , then weak detection (performing slightly better than random guessing) is statistically impossible, irrespectively of the value of . This compliments the performance of a simple test that thresholds the sum all entries of . Furthermore, when is fixed, we prove that strong detection (vanishing error probability) is impossible for any , where is an explicit function of , while weak detection is again impossible as long as . These results close significant gaps in current recent related studies.
View on arXiv