42
18

Two-sample Testing for Large, Sparse High-Dimensional Multinomials under Rare/Weak Perturbations

Abstract

Given two samples from possibly different discrete distributions over a common set of size NN, consider the problem of testing whether these distributions are identical, vs. the following rare/weak perturbation alternative: the frequencies of N1βN^{1-\beta} elements are perturbed by r(logN)/2nr(\log N)/2n in the Hellinger distance, where nn is the size of each sample. We adapt the Higher Criticism (HC) test to this setting using P-values obtained from NN exact binomial tests. We characterize the asymptotic performance of the HC-based test in terms of the sparsity parameter β\beta and the perturbation intensity parameter rr. Specifically, we derive a region in the (β,r)(\beta,r)-plane where the test asymptotically has maximal power, while having asymptotically no power outside this region. Our analysis distinguishes between the cases of dense (NnN\gg n) and sparse (NnN\ll n) contingency tables. In the dense case, the phase transition curve matches that of an analogous two-sample normal means model.

View on arXiv
Comments on this paper