Testing Closeness of Multivariate Distributions via Ramsey Theory

We investigate the statistical task of closeness (or equivalence) testing for multidimensional distributions. Specifically, given sample access to two unknown distributions on , we want to distinguish between the case that versus , where denotes the generalized distance between and -- measuring the maximum discrepancy between the distributions over any collection of disjoint, axis-aligned rectangles. Our main result is the first closeness tester for this problem with {\em sub-learning} sample complexity in any fixed dimension and a nearly-matching sample complexity lower bound. In more detail, we provide a computationally efficient closeness tester with sample complexity . On the lower bound side, we establish a qualitatively matching sample complexity lower bound of , even for . These sample complexity bounds are surprising because the sample complexity of the problem in the univariate setting is . This has the interesting consequence that the jump from one to two dimensions leads to a substantial increase in sample complexity, while increases beyond that do not. As a corollary of our general tester, we obtain -closeness testers for pairs of -histograms on over a common unknown partition, and pairs of uniform distributions supported on the union of unknown disjoint axis-aligned rectangles. Both our algorithm and our lower bound make essential use of tools from Ramsey theory.
View on arXiv