31
2

WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training

Abstract

In this work, we present WLB-LLM, a workLoad-balanced 4D parallelism for large language model training. We first thoroughly analyze the workload imbalance issue in LLM training and identify two primary sources of imbalance at the pipeline parallelism and context parallelism levels. Then, to address the imbalance issue, at the pipeline parallelism level, WLB-LLM incorporates a workload-aware variable-length document packing method to balance the computation and communication workload across micro-batches. Additionally, at the context parallelism level, WLB-LLM introduces a novel fine-grained per-document sharding strategy, ensuring each worker within a context parallelism group has an identical workload. Comprehensive experiments under different model scales demonstrate that WLB-LLM significantly mitigates the workload imbalance during 4D parallelism LLM training and achieves an average speedup of 1.23x when applying WLB-LLM in our internal LLM training framework.

View on arXiv
@article{wang2025_2503.17924,
  title={ WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training },
  author={ Zheng Wang and Anna Cai and Xinfeng Xie and Zaifeng Pan and Yue Guan and Weiwei Chu and Jie Wang and Shikai Li and Jianyu Huang and Chris Cai and Yuchen Hao and Yufei Ding },
  journal={arXiv preprint arXiv:2503.17924},
  year={ 2025 }
}
Comments on this paper