Compact Representation of Uncertainty in Hierarchical Clustering
- TPM

Hierarchical clustering is a fundamental task often used to discover meaningful structures in data, such as phylogenetic trees, taxonomies of concepts, subtypes of cancer, and cascades of particle decays in particle physics. When multiple hierarchical clusterings of the data are possible, it is useful to represent uncertainty in the clustering through various probabilistic quantities, such as the distribution over tree structures and the marginal probabilities of subtrees. Existing approaches represent uncertainty for a range of models; however, they only provide \emph{approximate} inference. This paper presents dynamic-programming algorithms and proofs for \emph{exact} inference in hierarchical clustering at small but useful scales. We are able to compute the partition function, MAP hierarchical clustering, and marginal probabilities of sub-hierarchies and clusters. Our method supports a wide range of hierarchical clustering models and only requires a cluster compatibility function. Rather than scaling with the number of hierarchical clusterings of elements ((2N-3)!!), our approach runs in time and space proportional to the significantly smaller powerset of . Despite still being large, there are many important applications at the practically-computable range of . We demonstrate the advantages of exact inference on synthetic data of interest to Dasgupta's cost as well as on two real world applications, in particle physics at the Large Hadron Collider at CERN and in cancer genomics.
View on arXiv