35
23

Analysis of Ward's Method

Abstract

We study Ward's method for the hierarchical kk-means problem. This popular greedy heuristic is based on the \emph{complete linkage} paradigm: Starting with all data points as singleton clusters, it successively merges two clusters to form a clustering with one cluster less. The pair of clusters is chosen to (locally) minimize the kk-means cost of the clustering in the next step. Complete linkage algorithms are very popular for hierarchical clustering problems, yet their theoretical properties have been studied relatively little. For the Euclidean kk-center problem, Ackermann et al. show that the kk-clustering in the hierarchy computed by complete linkage has a worst-case approximation ratio of Θ(logk)\Theta(\log k). If the data lies in Rd\mathbb{R}^d for constant dimension dd, the guarantee improves to O(1)\mathcal{O}(1), but the O\mathcal{O}-notation hides a linear dependence on dd. Complete linkage for kk-median or kk-means has not been analyzed so far. In this paper, we show that Ward's method computes a 22-approximation with respect to the kk-means objective function if the optimal kk-clustering is well separated. If additionally the optimal clustering also satisfies a balance condition, then Ward's method fully recovers the optimum solution. These results hold in arbitrary dimension. We accompany our positive results with a lower bound of Ω((3/2)d)\Omega((3/2)^d) for data sets in Rd\mathbb{R}^d that holds if no separation is guaranteed, and with lower bounds when the guaranteed separation is not sufficiently strong. Finally, we show that Ward produces an O(1)\mathcal{O}(1)-approximative clustering for one-dimensional data sets.

View on arXiv
Comments on this paper