18
9

Order preserving hierarchical agglomerative clustering

Abstract

We present a method for hierarchical clustering of directed acyclic graphs and other strictly partially ordered data that preserves the data structure. In particular, if we have a<ba<b in the original data and denote their respective clusters by [a][a] and [b][b], we get [a]<[b][a]<[b] in the produced clustering. The clustering uses standard linkage functions, such as single- and complete linkage, and is a generalisation of hierarchical clustering of non-ordered sets. To achieve this, we define the output from running hierarchical clustering algorithms on strictly ordered data to be partial dendrograms; sub-trees of classical dendrograms with several connected components. We then construct an embedding of partial dendrograms over a set into the family of ultrametrics over the same set. An optimal hierarchical clustering is now defined as follows: Given a collection of partial dendrograms, the optimal clustering is the partial dendrogram corresponding to the ultrametric closest to the original dissimilarity measure, measured in the pp-norm. Thus, the method is a combination of classical hierarchical clustering and ultrametric fitting.

View on arXiv
Comments on this paper