167

Randomized Near Neighbor Graphs, Giant Components, and Applications in Data Science

Abstract

If we pick nn random points uniformly in [0,1]d[0,1]^d and connect each point to its kk-nearest neighbors, then it is well known that there exists a giant connected component with high probability. We prove that in [0,1]d[0,1]^d it suffices to connect every point to $ c_{d,1} \log{\log{n}}$ points chosen randomly among its $ c_{d,2} \log{n}-$nearest neighbors to ensure a giant component of size no(n)n - o(n) with high probability. This construction yields a much sparser random graph with nloglogn\sim n \log\log{n} instead of nlogn\sim n \log{n} edges that has comparable connectivity properties. This result has nontrivial implications for problems in data science where an affinity matrix is constructed: instead of picking the kk-nearest neighbors, one can often pick kkk' \ll k random points out of the kk-nearest neighbors without sacrificing efficiency. This can massively simplify and accelerate computation, we illustrate this with several numerical examples.

View on arXiv
Comments on this paper