GraphGen+: Advancing Distributed Subgraph Generation and Graph Learning On Industrial Graphs

8 March 2025

Abstract

Graph-based computations are crucial in a wide range of applications, where graphs can scale to trillions of edges. To enable efficient training on such large graphs, mini-batch subgraph sampling is commonly used, which allows training without loading the entire graph into memory. However, existing solutions face significant trade-offs: online subgraph generation, as seen in frameworks like DGL and PyG, is limited to a single machine, resulting in severe performance bottlenecks, while offline precomputed subgraphs, as in GraphGen, improve sampling efficiency but introduce large storage overhead and high I/O costs during training. To address these challenges, we propose \textbf{GraphGen+}, an integrated framework that synchronizes distributed subgraph generation with in-memory graph learning, eliminating the need for external storage while significantly improving efficiency. GraphGen+ achieves a \textbf{27 $\times$ } speedup in subgraph generation compared to conventional SQL-like methods and a \textbf{1.3 $\times$ } speedup over GraphGen, supporting training on 1 million nodes per iteration and removing the overhead associated with precomputed subgraphs, making it a scalable and practical solution for industry-scale graph learning.

View on arXiv

@article{jin2025_2503.06212,
  title={ GraphGen+: Advancing Distributed Subgraph Generation and Graph Learning On Industrial Graphs },
  author={ Yue Jin and Yongchao Liu and Chuntao Hong },
  journal={arXiv preprint arXiv:2503.06212},
  year={ 2025 }
}

Comments on this paper