84
3

Sequential algorithms for testing identity and closeness of distributions

Abstract

What advantage do \emph{sequential} procedures provide over batch algorithms for testing properties of unknown distributions? Focusing on the problem of testing whether two distributions D1\mathcal{D}_1 and D2\mathcal{D}_2 on {1,,n}\{1,\dots, n\} are equal or ϵ\epsilon-far, we give several answers to this question. We show that for a small alphabet size nn, there is a sequential algorithm that outperforms any batch algorithm by a factor of at least 44 in terms sample complexity. For a general alphabet size nn, we give a sequential algorithm that uses no more samples than its batch counterpart, and possibly fewer if the actual distance TV(D1,D2)TV(\mathcal{D}_1, \mathcal{D}_2) between D1\mathcal{D}_1 and D2\mathcal{D}_2 is larger than ϵ\epsilon. As a corollary, letting ϵ\epsilon go to 00, we obtain a sequential algorithm for testing closeness when no a priori bound on TV(D1,D2)TV(\mathcal{D}_1, \mathcal{D}_2) is given that has a sample complexity O~(n2/3TV(D1,D2)4/3)\tilde{\mathcal{O}}(\frac{n^{2/3}}{TV(\mathcal{D}_1, \mathcal{D}_2)^{4/3}}): this improves over the O~(n/lognTV(D1,D2)2)\tilde{\mathcal{O}}(\frac{n/\log n}{TV(\mathcal{D}_1, \mathcal{D}_2)^{2} }) tester of \cite{daskalakis2017optimal} and is optimal up to multiplicative constants. We also establish limitations of sequential algorithms for the problem of testing identity and closeness: they can improve the worst case number of samples by at most a constant factor.

View on arXiv
Comments on this paper