104

A Nearly Tight Analysis of Greedy k-means++

ACM-SIAM Symposium on Discrete Algorithms (SODA), 2022
Abstract

The famous kk-means++ algorithm of Arthur and Vassilvitskii [SODA 2007] is the most popular way of solving the kk-means problem in practice. The algorithm is very simple: it samples the first center uniformly at random and each of the following k1k-1 centers is then always sampled proportional to its squared distance to the closest center so far. Afterward, Lloyd's iterative algorithm is run. The kk-means++ algorithm is known to return a Θ(logk)\Theta(\log k) approximate solution in expectation. In their seminal work, Arthur and Vassilvitskii [SODA 2007] asked about the guarantees for its following \emph{greedy} variant: in every step, we sample \ell candidate centers instead of one and then pick the one that minimizes the new cost. This is also how kk-means++ is implemented in e.g. the popular Scikit-learn library [Pedregosa et al.; JMLR 2011]. We present nearly matching lower and upper bounds for the greedy kk-means++: We prove that it is an O(3log3k)O(\ell^3 \log^3 k)-approximation algorithm. On the other hand, we prove a lower bound of Ω(3log3k/log2(logk))\Omega(\ell^3 \log^3 k / \log^2(\ell\log k)). Previously, only an Ω(logk)\Omega(\ell \log k) lower bound was known [Bhattacharya, Eube, R\"oglin, Schmidt; ESA 2020] and there was no known upper bound.

View on arXiv
Comments on this paper