20
11

On Efficient Low Distortion Ultrametric Embedding

Abstract

A classic problem in unsupervised learning and data analysis is to find simpler and easy-to-visualize representations of the data that preserve its essential properties. A widely-used method to preserve the underlying hierarchical structure of the data while reducing its complexity is to find an embedding of the data into a tree or an ultrametric. The most popular algorithms for this task are the classic linkage algorithms (single, average, or complete). However, these methods on a data set of nn points in Ω(logn)\Omega(\log n) dimensions exhibit a quite prohibitive running time of Θ(n2)\Theta(n^2). In this paper, we provide a new algorithm which takes as input a set of points PP in Rd\mathbb{R}^d, and for every c1c\ge 1, runs in time n1+ρc2n^{1+\frac{\rho}{c^2}} (for some universal constant ρ>1\rho>1) to output an ultrametric Δ\Delta such that for any two points u,vu,v in PP, we have Δ(u,v)\Delta(u,v) is within a multiplicative factor of 5c5c to the distance between uu and vv in the "best" ultrametric representation of PP. Here, the best ultrametric is the ultrametric Δ~\tilde\Delta that minimizes the maximum distance distortion with respect to the 2\ell_2 distance, namely that minimizes maxu,vP Δ~(u,v)uv2\underset{u,v \in P}{\max}\ \frac{\tilde\Delta(u,v)}{\|u-v\|_2}. We complement the above result by showing that under popular complexity theoretic assumptions, for every constant ε>0\varepsilon>0, no algorithm with running time n2εn^{2-\varepsilon} can distinguish between inputs in \ell_\infty-metric that admit isometric embedding and those that incur a distortion of 32\frac{3}{2}. Finally, we present empirical evaluation on classic machine learning datasets and show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.

View on arXiv
Comments on this paper