406

The Minimax Risk in Histogram-Based Uniformity Testing under Missing Ball Alternatives

Abstract

We study the problem of testing the goodness of fit of a discrete sample from many categories to the uniform distribution over the categories. As a class of alternative hypotheses, we consider the removal of an p\ell_p ball of radius ϵ\epsilon around the uniform rate sequence for p2p \leq 2. When the number of samples nn and number of categories NN go to infinity while ϵ\epsilon is small, the minimax risk RϵR_\epsilon^* in testing based on the sample's histogram (number of absent categories, singletons, collisions, ...) asymptotes to 2Φ(nN22/pϵ2/8N)2\Phi(-n N^{2-2/p} \epsilon^2/\sqrt{8N}); Φ(x)\Phi(x) is the normal CDF. This result allows the comparison of the many estimators previously proposed for this problem at the constant level, rather than at the rate of convergence of the risk or the scaling order of the sample complexity. The minimax test mostly relies on collisions in the very small sample limit but otherwise behaves like the chisquared test. Empirical studies over a range of problem parameters show that our estimate is accurate in finite samples and that the minimax test is significantly better than the chisquared test or a test that only uses collisions. Our analysis relies on the asymptotic normality of histogram ordinates, the equivalence between the minimax setting and a Bayesian setting, and the characterization of the least favorable prior by reducing a multi-dimensional optimization problem to a one-dimensional problem.

View on arXiv
Comments on this paper