34
5

Fully Scalable MPC Algorithms for Clustering in High Dimension

Abstract

We design new parallel algorithms for clustering in high-dimensional Euclidean spaces. These algorithms run in the Massively Parallel Computation (MPC) model, and are fully scalable, meaning that the local memory in each machine may be nσn^{\sigma} for arbitrarily small fixed σ>0\sigma>0. Importantly, the local memory may be substantially smaller than the number of clusters kk, yet all our algorithms are fast, i.e., run in O(1)O(1) rounds. We first devise a fast MPC algorithm for O(1)O(1)-approximation of uniform facility location. This is the first fully-scalable MPC algorithm that achieves O(1)O(1)-approximation for any clustering problem in general geometric setting; previous algorithms only provide poly(logn)\mathrm{poly}(\log n)-approximation or apply to restricted inputs, like low dimension or small number of clusters kk; e.g. [Bhaskara and Wijewardena, ICML'18; Cohen-Addad et al., NeurIPS'21; Cohen-Addad et al., ICML'22]. We then build on this facility location result and devise a fast MPC algorithm that achieves O(1)O(1)-bicriteria approximation for kk-Median and for kk-Means, namely, it computes (1+ε)k(1+\varepsilon)k clusters of cost within O(1/ε2)O(1/\varepsilon^2)-factor of the optimum for kk clusters. A primary technical tool that we introduce, and may be of independent interest, is a new MPC primitive for geometric aggregation, namely, computing for every data point a statistic of its approximate neighborhood, for statistics like range counting and nearest-neighbor search. Our implementation of this primitive works in high dimension, and is based on consistent hashing (aka sparse partition), a technique that was recently used for streaming algorithms [Czumaj et al., FOCS'22].

View on arXiv
Comments on this paper