An Analysis of $D^α$ seeding for $k$ -means

20 October 2023

Abstract

One of the most popular clustering algorithms is the celebrated $D^\alpha$ seeding algorithm (also know as $k$ -means++ when $\alpha=2$ ) by Arthur and Vassilvitskii (2007), who showed that it guarantees in expectation an $O(2^{2\alpha}\cdot \log k)$ -approximate solution to the ( $k$ , $\alpha$ )-means cost (where euclidean distances are raised to the power $\alpha$ ) for any $\alpha\ge 1$ . More recently, Balcan, Dick, and White (2018) observed experimentally that using $D^\alpha$ seeding with $\alpha>2$ can lead to a better solution with respect to the standard $k$ -means objective (i.e. the $(k,2)$ -means cost). In this paper, we provide a rigorous understanding of this phenomenon. For any $\alpha>2$ , we show that $D^\alpha$ seeding guarantees in expectation an approximation factor of $O_\alpha \left((g_\alpha)^{2/\alpha}\cdot \left(\frac{\sigma_{\mathrm{max}}}{\sigma_{\mathrm{min}}}\right)^{2-4/\alpha}\cdot (\min\{\ell,\log k\})^{2/\alpha}\right)$ with respect to the standard $k$ -means cost of any underlying clustering; where $g_\alpha$ is a parameter capturing the concentration of the points in each cluster, $\sigma_{\mathrm{max}}$ and $\sigma_{\mathrm{min}}$ are the maximum and minimum standard deviation of the clusters around their means, and $\ell$ is the number of distinct mixing weights in the underlying clustering (after rounding them to the nearest power of $2$ ). We complement these results by some lower bounds showing that the dependency on $g_\alpha$ and $\sigma_{\mathrm{max}}/\sigma_{\mathrm{min}}$ is tight. Finally, we provide an experimental confirmation of the effects of the aforementioned parameters when using $D^\alpha$ seeding. Further, we corroborate the observation that $\alpha>2$ can indeed improve the $k$ -means cost compared to $D^2$ seeding, and that this advantage remains even if we run Lloyd's algorithm after the seeding.

View on arXiv

Comments on this paper

An Analysis of DαD^αDα seeding for kkk-means

An Analysis of $D^α$ seeding for $k$ -means