Testing with Non-identically Distributed Samples

We examine the extent to which sublinear-sample property testing and estimation applies to settings where samples are independently but not identically distributed. Specifically, we consider the following distributional property testing framework: Suppose there is a set of distributions over a discrete support of size , , and we obtain independent draws from each distribution. Suppose the goal is to learn or test a property of the average distribution, . This setup models a number of important practical settings where the individual distributions correspond to heterogeneous entities -- either individuals, chronologically distinct time periods, spatially separated data sources, etc. From a learning standpoint, even with samples from each distribution, samples are necessary and sufficient to learn to within error in TV distance. To test uniformity or identity -- distinguishing the case that is equal to some reference distribution, versus has distance at least from the reference distribution, we show that a linear number of samples in is necessary given samples from each distribution. In contrast, for , we recover the usual sublinear sample testing of the i.i.d. setting: we show that samples are sufficient, matching the optimal sample complexity in the i.i.d. case in the regime where . Additionally, we show that in the case, there is a constant such that even in the linear regime with samples, no tester that considers the multiset of samples (ignoring which samples were drawn from the same ) can perform uniformity testing.
View on arXiv