9
9

Optimal Coreset for Gaussian Kernel Density Estimation

Abstract

Given a point set PRdP\subset \mathbb{R}^d, the kernel density estimate of PP is defined as \[ \overline{\mathcal{G}}_P(x) = \frac{1}{\left|P\right|}\sum_{p\in P}e^{-\left\lVert x-p \right\rVert^2} \] for any xRdx\in\mathbb{R}^d. We study how to construct a small subset QQ of PP such that the kernel density estimate of PP is approximated by the kernel density estimate of QQ. This subset QQ is called a coreset. The main technique in this work is constructing a ±1\pm 1 coloring on the point set PP by discrepancy theory and we leverage Banaszczyk's Theorem. When d>1d>1 is a constant, our construction gives a coreset of size O(1ε)O\left(\frac{1}{\varepsilon}\right) as opposed to the best-known result of O(1εlog1ε)O\left(\frac{1}{\varepsilon}\sqrt{\log\frac{1}{\varepsilon}}\right). It is the first result to give a breakthrough on the barrier of log\sqrt{\log} factor even when d=2d=2.

View on arXiv
Comments on this paper