We present a framework for nonparametrically testing the independence between two random vectors that is scalable to massive data both computationally and statistically. We take a multi-scale divide-and-conquer strategy by breaking down the multivariate test into univariate tests of independence on a collection of 2x2 contingency tables, constructed by sequentially discretizing the sample space from coarse to fine scales. This strategy transforms a nonparametric testing problem that traditionally requires quadratic computational complexity with respect to sample size into one that scales almost linearly with the sample size. We further consider the scenario when the dimensionality of the random vectors grows large, in which case the curse of dimensionality demonstrates itself in our framework through an explosion in the number of univariate tests to be completed. To address this challenge we propose a data-adaptive coarse-to-fine testing procedure that completes a fraction of the univariate tests, which are judged to be prone to containing evidence for dependency by exploiting the spatial features of dependency structures. We provide a finite-sample theoretical guarantee for the exact validity of the adaptive procedure. In particular, we show that this procedure satisfies strong control of the family-wise error rate without any need for resampling or large-sample approximation, which existing approaches typically require. We demonstrate the substantial computational advantage of the procedure in comparison to existing approaches as well as its desirable statistical power under various dependency scenarios through an extensive simulation study. We illustrate through examples that our framework can be used not only for testing independence but for learning the nature of the underlying dependency. Finally, we demonstrate the use of our method through analyzing a data set from flowcytometry.
View on arXiv