HiCCL: A Hierarchical Collective Communication Library

IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2024

12 August 2024

Mert Hidayetoğlu

Simon Garcia De Gonzalo

Alex Aiken

Main:11 Pages

13 Figures

Bibliography:2 Pages

6 Tables

Abstract

HiCCL (Hierarchical Collective Communication Library) addresses the growing complexity and diversity in high-performance network architectures. As GPU systems have envolved into networks of GPUs with different multilevel communication hierarchies, optimizing each collective function for a specific system has become a challenging task. Consequently, many collective libraries struggle to adapt to different hardware and software, especially across systems from different vendors. HiCCL's library design decouples the collective communication logic from network-specific optimizations through a compositional API. The communication logic is composed using multicast, reduction, and fence primitives, which are then factorized for a specified network hieararchy using only point-to-point operations within a level. Finally, striping and pipelining optimizations applied as specified for streamlining the execution. Performance evaluation of HiCCL across four different machines $\unicode{x2014}$ two with Nvidia GPUs, one with AMD GPUs, and one with Intel GPUs $\unicode{x2014}$ demonstrates an average 17 $\times$ higher throughput than the collectives of highly specialized GPU-aware MPI implementations, and competitive throughput with those of vendor-specific libraries (NCCL, RCCL, and OneCCL), while providing portability across all four machines.

View on arXiv

Comments on this paper