Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning

8 January 2025

Abstract

Scaling up Large Language Model(LLM) training involves fitting a tremendous amount of training parameters across a limited number of workers. However, methods like ZeRO-3 that drastically reduce GPU memory pressure often incur heavy communication to ensure global synchronization and consistency. Established efforts such as ZeRO++ use secondary partitions to avoid inter-node communications, given that intra-node GPU-GPU transfer generally has more bandwidth and lower latency than inter-node connections. However, as more capable infrastructure like Frontier, equipped with AMD GPUs, emerged with impressive computing capability, there is a need for investigations on the hardware topology and to develop targeted strategies to improve training efficiency. In this work, we propose a collection of communication and optimization strategies for ZeRO++ to reduce communication costs and improve memory utilization. In this paper, we propose a 3-level hierarchical partitioning specifically for the current 2nd ranked supercomputing cluster, Frontier, which aims at leveraging various bandwidths across layers of communications (GCD-GCD, GPU-GPU, and inter-node) to reduce communication overhead. For a 20B GPT model, we observe a 1.71x increase in TFLOPS per GPU when compared with ZeRO++ up to 384 GCDs and a scaling efficiency of 0.94 for up to 384 GCDs.

View on arXiv

@article{xu2025_2501.04266,
  title={ Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning },
  author={ Lang Xu and Quentin Anthony and Jacob Hatef and Aamir Shafi and Hari Subramoni and Dhabaleswar K. and Panda },
  journal={arXiv preprint arXiv:2501.04266},
  year={ 2025 }
}

Comments on this paper