We present MegaScale-MoE, a production system tailored for the efficient training of large-scale mixture-of-experts (MoE) models. MoE emerges as a promising architecture to scale large language models (LLMs) to unprecedented sizes, thereby enhancing model performance. However, existing MoE training systems experience a degradation in training efficiency, exacerbated by the escalating scale of MoE models and the continuous evolution of hardware.

View on arXiv

@article{jin2025_2505.11432,
  title={ MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production },
  author={ Chao Jin and Ziheng Jiang and Zhihao Bai and Zheng Zhong and Juncai Liu and Xiang Li and Ningxin Zheng and Xi Wang and Cong Xie and Qi Huang and Wen Heng and Yiyuan Ma and Wenlei Bao and Size Zheng and Yanghua Peng and Haibin Lin and Xuanzhe Liu and Xin Jin and Xin Liu },
  journal={arXiv preprint arXiv:2505.11432},
  year={ 2025 }
}

Comments on this paper