403

The minimax risk in testing the histogram of discrete distributions for uniformity under missing ball alternatives

Abstract

We consider the problem of testing the fit of a discrete sample of items from many categories to the uniform distribution over the categories. As a class of alternative hypotheses, we consider the removal of an p\ell_p ball of radius ϵ\epsilon around the uniform rate sequence for p2p \leq 2. We deliver a sharp characterization of the asymptotic minimax risk when ϵ0\epsilon \to 0 as the number of samples and number of dimensions go to infinity, for testing based on the occurrences' histogram (number of absent categories, singletons, collisions, ...). For example, for p=1p=1 and in the limit of a small expected number of samples nn compared to the number of categories NN (aka "sub-linear" regime), the minimax risk RϵR^*_\epsilon asymptotes to $2 \bar{\Phi}\left(n \epsilon^2/\sqrt{8N}\right) $, with Φˉ(x)\bar{\Phi}(x) the normal survival function. Empirical studies over a range of problem parameters show that this estimate is accurate in finite samples, and that our test is significantly better than the chisquared test or a test that only uses collisions. Our analysis is based on the asymptotic normality of histogram ordinates, the equivalence between the minimax setting to a Bayesian one, and the reduction of a multi-dimensional optimization problem to a one-dimensional problem.

View on arXiv
Comments on this paper