An Analysis of seeding for -means

One of the most popular clustering algorithms is the celebrated seeding algorithm (also know as -means++ when ) by Arthur and Vassilvitskii (2007), who showed that it guarantees in expectation an -approximate solution to the (,)-means cost (where euclidean distances are raised to the power ) for any . More recently, Balcan, Dick, and White (2018) observed experimentally that using seeding with can lead to a better solution with respect to the standard -means objective (i.e. the -means cost). In this paper, we provide a rigorous understanding of this phenomenon. For any , we show that seeding guarantees in expectation an approximation factor of with respect to the standard -means cost of any underlying clustering; where is a parameter capturing the concentration of the points in each cluster, and are the maximum and minimum standard deviation of the clusters around their means, and is the number of distinct mixing weights in the underlying clustering (after rounding them to the nearest power of ). We complement these results by some lower bounds showing that the dependency on and is tight. Finally, we provide an experimental confirmation of the effects of the aforementioned parameters when using seeding. Further, we corroborate the observation that can indeed improve the -means cost compared to seeding, and that this advantage remains even if we run Lloyd's algorithm after the seeding.
View on arXiv