We design new algorithms for -clustering in high-dimensional Euclidean spaces. These algorithms run in the Massively Parallel Computation (MPC) model, and are fully scalable, meaning that the local memory in each machine is for arbitrarily small fixed . Importantly, the local memory may be substantially smaller than . Our algorithms take rounds and achieve -bicriteria approximation for -Median and for -Means, namely, they compute clusters of cost within -factor of the optimum. Previous work achieves only -bicriteria approximation [Bhaskara et al., ICML'18], or handles a special case [Cohen-Addad et al., ICML'22]. Our results rely on an MPC algorithm for -approximation of facility location in rounds. A primary technical tool that we develop, and may be of independent interest, is a new MPC primitive for geometric aggregation, namely, computing certain statistics on an approximate neighborhood of every data point, which includes range counting and nearest-neighbor search. Our implementation of this primitive works in high dimension, and is based on consistent hashing (aka sparse partition), a technique that was recently used for streaming algorithms [Czumaj et al., FOCS'22].
View on arXiv