Two-sample Testing for Large, Sparse High-Dimensional Multinomials under Rare/Weak Perturbations

3 July 2020

Abstract

Given two samples from possibly different discrete distributions over a common set of size $N$ , consider the problem of testing whether these distributions are identical, vs. the following rare/weak perturbation alternative: the frequencies of $N^{1-\beta}$ elements are perturbed by $r(\log N)/2n$ in the Hellinger distance, where $n$ is the size of each sample. We adapt the Higher Criticism (HC) test to this setting using P-values obtained from $N$ exact binomial tests. We characterize the asymptotic performance of the HC-based test in terms of the sparsity parameter $\beta$ and the perturbation intensity parameter $r$ . Specifically, we derive a region in the $(\beta,r)$ -plane where the test asymptotically has maximal power, while having asymptotically no power outside this region. Our analysis distinguishes between the cases of dense ( $N\gg n$ ) and sparse ( $N\ll n$ ) contingency tables. In the dense case, the phase transition curve matches that of an analogous two-sample normal means model.

View on arXiv

Comments on this paper