68
0

Scalable k-Means Clustering for Large k via Seeded Approximate Nearest-Neighbor Search

Abstract

For very large values of kk, we consider methods for fast kk-means clustering of massive datasets with 10710910^7\sim10^9 points in high-dimensions (d100d\geq100). All current practical methods for this problem have runtimes at least Ω(k2)\Omega(k^2). We find that initialization routines are not a bottleneck for this case. Instead, it is critical to improve the speed of Lloyd's local-search algorithm, particularly the step that reassigns points to their closest center. Attempting to improve this step naturally leads us to leverage approximate nearest-neighbor search methods, although this alone is not enough to be practical. Instead, we propose a family of problems we call "Seeded Approximate Nearest-Neighbor Search", for which we propose "Seeded Search-Graph" methods as a solution.

View on arXiv
@article{spalding-jamieson2025_2502.06163,
  title={ Scalable k-Means Clustering for Large k via Seeded Approximate Nearest-Neighbor Search },
  author={ Jack Spalding-Jamieson and Eliot Wong Robson and Da Wei Zheng },
  journal={arXiv preprint arXiv:2502.06163},
  year={ 2025 }
}
Comments on this paper