Given a collection of points in , the goal of the -clustering problem is to find a subset of "centers" that minimizes the sum of the -th powers of the Euclidean distance of each point to the closest center. Special cases of the -clustering problem include the -median and -means problems. Our main result is a unified two-stage importance sampling framework that constructs an -coreset for the -clustering problem. Compared to the results for -clustering in [Feldman and Langberg, STOC 2011], our framework saves a factor in the coreset size. Compared to the results for -clustering in [Sohler and Woodruff, FOCS 2018], our framework saves a factor in the coreset size and avoids the term in the construction time. Specifically, our coreset for -median () has size which, when compared to the result in [Sohler and Woodruff, STOC 2018], saves a factor in the coreset size. Our algorithmic results rely on a new dimensionality reduction technique that connects two well-known shape fitting problems: subspace approximation and clustering, and may be of independent interest. We also provide a size lower bound of for a -coreset for -clustering, which has a linear dependence of size on and an exponential dependence on that matches our algorithmic results.
View on arXiv