Spectral clustering based on local linear approximations

In the context of clustering, we assume a generative model where each cluster is the result of sampling points in the neighborhood of an embedded smooth surface, possibly contaminated with outliers. We consider a prototype for a higher-order spectral clustering method based on the residual from a local linear approximation. In an asymptotic setting where the number of points becomes large, we obtain theoretical guaranties for this algorithm and show that, both in terms of separation and robustness to outliers, it outperforms the standard spectral clustering algorithm based on pairwise distances of Ng, Jordan and Weiss (NIPS, 2001). Under some conditions on the dimension of, and the incidence angle at, an intersection, the algorithm is able to recover the intersecting clusters. The optimal choice for some of the tuning parameters depends on the dimension and thickness of the clusters. We provide estimators that come close enough for our purposes. We discuss the cases of clusters of mixed dimensions and of clusters that are generated from smoother surfaces. We briefly discuss computational issues, arguing that computations may be restricted to a few nearest-neighbors without compromising the theoretical guaranties. The resulting implementation runs in almost linear time. We include numerical experiments illustrating the theory.
View on arXiv