Kernel Two-Sample Tests for Manifold Data

We present a study of a kernel-based two-sample test statistic related to the Maximum Mean Discrepancy (MMD) in the manifold data setting, assuming that high-dimensional observations are close to a low-dimensional manifold. We characterize the test level and power in relation to the kernel bandwidth, the number of samples, and the intrinsic dimensionality of the manifold. Specifically, we show that when data densities are supported on a -dimensional sub-manifold embedded in an -dimensional space, the kernel two-sample test for data sampled from a pair of distributions and that are H\"older with order (up to 2) is powerful when the number of samples is large such that , where is the squared -divergence between and on manifold. We establish a lower bound on the test power for finite that is sufficiently large, where the kernel bandwidth parameter scales as . The analysis extends to cases where the manifold has a boundary, and the data samples contain high-dimensional additive noise. Our results indicate that the kernel two-sample test does not have a curse-of-dimensionality when the data lie on or near a low-dimensional manifold. We validate our theory and the properties of the kernel test for manifold data through a series of numerical experiments.
View on arXiv