45
9

Kernel Two-Sample Tests for Manifold Data

Abstract

We present a study of a kernel-based two-sample test statistic related to the Maximum Mean Discrepancy (MMD) in the manifold data setting, assuming that high-dimensional observations are close to a low-dimensional manifold. We characterize the test level and power in relation to the kernel bandwidth, the number of samples, and the intrinsic dimensionality of the manifold. Specifically, we show that when data densities are supported on a dd-dimensional sub-manifold M\mathcal{M} embedded in an mm-dimensional space, the kernel two-sample test for data sampled from a pair of distributions pp and qq that are H\"older with order β\beta (up to 2) is powerful when the number of samples nn is large such that Δ2n2β/(d+4β)\Delta_2 \gtrsim n^{- { 2 \beta/( d + 4 \beta ) }}, where Δ2\Delta_2 is the squared L2L^2-divergence between pp and qq on manifold. We establish a lower bound on the test power for finite nn that is sufficiently large, where the kernel bandwidth parameter γ\gamma scales as n1/(d+4β)n^{-1/(d+4\beta)}. The analysis extends to cases where the manifold has a boundary, and the data samples contain high-dimensional additive noise. Our results indicate that the kernel two-sample test does not have a curse-of-dimensionality when the data lie on or near a low-dimensional manifold. We validate our theory and the properties of the kernel test for manifold data through a series of numerical experiments.

View on arXiv
Comments on this paper