ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2004.06263
11
76

Coresets for Clustering in Euclidean Spaces: Importance Sampling is Nearly Optimal

14 April 2020
Lingxiao Huang
Nisheeth K. Vishnoi
ArXivPDFHTML
Abstract

Given a collection of nnn points in Rd\mathbb{R}^dRd, the goal of the (k,z)(k,z)(k,z)-clustering problem is to find a subset of kkk "centers" that minimizes the sum of the zzz-th powers of the Euclidean distance of each point to the closest center. Special cases of the (k,z)(k,z)(k,z)-clustering problem include the kkk-median and kkk-means problems. Our main result is a unified two-stage importance sampling framework that constructs an ε\varepsilonε-coreset for the (k,z)(k,z)(k,z)-clustering problem. Compared to the results for (k,z)(k,z)(k,z)-clustering in [Feldman and Langberg, STOC 2011], our framework saves a ε2d\varepsilon^2 dε2d factor in the coreset size. Compared to the results for (k,z)(k,z)(k,z)-clustering in [Sohler and Woodruff, FOCS 2018], our framework saves a poly⁡(k)\operatorname{poly}(k)poly(k) factor in the coreset size and avoids the exp⁡(k/ε)\exp(k/\varepsilon)exp(k/ε) term in the construction time. Specifically, our coreset for kkk-median (z=1z=1z=1) has size O~(ε−4k)\tilde{O}(\varepsilon^{-4} k)O~(ε−4k) which, when compared to the result in [Sohler and Woodruff, STOC 2018], saves a kkk factor in the coreset size. Our algorithmic results rely on a new dimensionality reduction technique that connects two well-known shape fitting problems: subspace approximation and clustering, and may be of independent interest. We also provide a size lower bound of Ω(k⋅min⁡{2z/20,d})\Omega\left(k\cdot \min \left\{2^{z/20},d \right\}\right)Ω(k⋅min{2z/20,d}) for a 0.010.010.01-coreset for (k,z)(k,z)(k,z)-clustering, which has a linear dependence of size on kkk and an exponential dependence on zzz that matches our algorithmic results.

View on arXiv
Comments on this paper