Two-sample Testing for Large, Sparse High-Dimensional Multinomials under Rare/Weak Perturbations

Given two samples from possibly different discrete distributions over a common set of size , consider the problem of testing whether these distributions are identical, vs. the following rare/weak perturbation alternative: the frequencies of elements are perturbed by in the Hellinger distance, where is the size of each sample. We adapt the Higher Criticism (HC) test to this setting using P-values obtained from exact binomial tests. We characterize the asymptotic performance of the HC-based test in terms of the sparsity parameter and the perturbation intensity parameter . Specifically, we derive a region in the -plane where the test asymptotically has maximal power, while having asymptotically no power outside this region. Our analysis distinguishes between the cases of dense () and sparse () contingency tables. In the dense case, the phase transition curve matches that of an analogous two-sample normal means model.
View on arXiv