We present a novel distributed algorithm for counting all four-node induced subgraphs in a big graph. These counts, called the -profile, describe a graph's connectivity properties and have found several uses ranging from bioinformatics to spam detection. We also study the more complicated problem of estimating the local -profiles centered at each vertex of the graph. The local -profile embeds every vertex in an -dimensional space that characterizes the local geometry of its neighborhood: vertices that connect different clusters will have different local -profiles compared to those that are only part of one dense cluster. Our algorithm is a local, distributed message-passing scheme on the graph and computes all the local -profiles in parallel. We rely on two novel theoretical contributions: we show that local -profiles can be calculated using compressed two-hop information and also establish novel concentration results that show that graphs can be substantially sparsified and still retain good approximation quality for the global -profile. We empirically evaluate our algorithm using a distributed GraphLab implementation that we scaled up to cores. We show that our algorithm can compute global and local -profiles of graphs with millions of edges in a few minutes, significantly improving upon the previous state of the art.
View on arXiv