272
v1v2v3 (latest)

Coresets for Clustering in Euclidean Spaces: Importance Sampling is Nearly Optimal

Symposium on the Theory of Computing (STOC), 2020
Abstract

Given a collection of nn points in Rd\mathbb{R}^d, the goal of the (k,z)(k,z)-clustering problem is to find a subset of kk "centers" that minimizes the sum of the zz-th powers of the Euclidean distance of each point to the closest center. Special cases of the (k,z)(k,z)-clustering problem include the kk-median and kk-means problems. Our main result is a unified two-stage importance sampling framework that constructs an ε\varepsilon-coreset for the (k,z)(k,z)-clustering problem. Compared to the results for (k,z)(k,z)-clustering in [Feldman and Langberg, STOC 2011], our framework saves a ε2d\varepsilon^2 d factor in the coreset size. Compared to the results for (k,z)(k,z)-clustering in [Sohler and Woodruff, FOCS 2018], our framework saves a poly(k)\operatorname{poly}(k) factor in the coreset size and avoids the exp(k/ε)\exp(k/\varepsilon) term in the construction time. Specifically, our coreset for kk-median (z=1z=1) has size O~(ε4k)\tilde{O}(\varepsilon^{-4} k) which, when compared to the result in [Sohler and Woodruff, STOC 2018], saves a kk factor in the coreset size. Our algorithmic results rely on a new dimensionality reduction technique that connects two well-known shape fitting problems: subspace approximation and clustering, and may be of independent interest. We also provide a size lower bound of Ω(kmin{2z/20,d})\Omega\left(k\cdot \min \left\{2^{z/20},d \right\}\right) for a 0.010.01-coreset for (k,z)(k,z)-clustering, which has a linear dependence of size on kk and an exponential dependence on zz that matches our algorithmic results.

View on arXiv
Comments on this paper