51

An Analysis of DαD^α seeding for kk-means

Abstract

One of the most popular clustering algorithms is the celebrated DαD^\alpha seeding algorithm (also know as kk-means++ when α=2\alpha=2) by Arthur and Vassilvitskii (2007), who showed that it guarantees in expectation an O(22αlogk)O(2^{2\alpha}\cdot \log k)-approximate solution to the (kk,α\alpha)-means cost (where euclidean distances are raised to the power α\alpha) for any α1\alpha\ge 1. More recently, Balcan, Dick, and White (2018) observed experimentally that using DαD^\alpha seeding with α>2\alpha>2 can lead to a better solution with respect to the standard kk-means objective (i.e. the (k,2)(k,2)-means cost). In this paper, we provide a rigorous understanding of this phenomenon. For any α>2\alpha>2, we show that DαD^\alpha seeding guarantees in expectation an approximation factor of Oα((gα)2/α(σmaxσmin)24/α(min{,logk})2/α) O_\alpha \left((g_\alpha)^{2/\alpha}\cdot \left(\frac{\sigma_{\mathrm{max}}}{\sigma_{\mathrm{min}}}\right)^{2-4/\alpha}\cdot (\min\{\ell,\log k\})^{2/\alpha}\right) with respect to the standard kk-means cost of any underlying clustering; where gαg_\alpha is a parameter capturing the concentration of the points in each cluster, σmax\sigma_{\mathrm{max}} and σmin\sigma_{\mathrm{min}} are the maximum and minimum standard deviation of the clusters around their means, and \ell is the number of distinct mixing weights in the underlying clustering (after rounding them to the nearest power of 22). We complement these results by some lower bounds showing that the dependency on gαg_\alpha and σmax/σmin\sigma_{\mathrm{max}}/\sigma_{\mathrm{min}} is tight. Finally, we provide an experimental confirmation of the effects of the aforementioned parameters when using DαD^\alpha seeding. Further, we corroborate the observation that α>2\alpha>2 can indeed improve the kk-means cost compared to D2D^2 seeding, and that this advantage remains even if we run Lloyd's algorithm after the seeding.

View on arXiv
Comments on this paper