On Efficient Low Distortion Ultrametric Embedding

A classic problem in unsupervised learning and data analysis is to find simpler and easy-to-visualize representations of the data that preserve its essential properties. A widely-used method to preserve the underlying hierarchical structure of the data while reducing its complexity is to find an embedding of the data into a tree or an ultrametric. The most popular algorithms for this task are the classic linkage algorithms (single, average, or complete). However, these methods on a data set of points in dimensions exhibit a quite prohibitive running time of . In this paper, we provide a new algorithm which takes as input a set of points in , and for every , runs in time (for some universal constant ) to output an ultrametric such that for any two points in , we have is within a multiplicative factor of to the distance between and in the "best" ultrametric representation of . Here, the best ultrametric is the ultrametric that minimizes the maximum distance distortion with respect to the distance, namely that minimizes . We complement the above result by showing that under popular complexity theoretic assumptions, for every constant , no algorithm with running time can distinguish between inputs in -metric that admit isometric embedding and those that incur a distortion of . Finally, we present empirical evaluation on classic machine learning datasets and show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.
View on arXiv