Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models. However, the massive model scale poses significant challenges for the underlying software and hardware systems. In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs. The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware. To select model configurations suitable for Ascend NPUs without repeatedly running the expensive experiments, we leverage simulation to compare the trade-off of various model hyperparameters. This study led to Pangu Ultra MoE, a sparse LLM with 718 billion parameters, and we conducted experiments on the model to verify the simulation results. On the system side, we dig into Expert Parallelism to optimize the communication between NPU devices to reduce the synchronization overhead. We also optimize the memory efficiency within the devices to further reduce the parameter and activation management overhead. In the end, we achieve an MFU of 30.0% when training Pangu Ultra MoE, with performance comparable to that of DeepSeek R1, on 6K Ascend NPUs, and demonstrate that the Ascend system is capable of harnessing all the training stages of the state-of-the-art language models. Extensive experiments indicate that our recipe can lead to efficient training of large-scale sparse language models with MoE. We also study the behaviors of such models for future reference.

View on arXiv

@article{tang2025_2505.04519,
  title={ Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs },
  author={ Yehui Tang and Yichun Yin and Yaoyuan Wang and Hang Zhou and Yu Pan and Wei Guo and Ziyang Zhang and Miao Rang and Fangcheng Liu and Naifu Zhang and Binghan Li and Yonghan Dong and Xiaojun Meng and Yasheng Wang and Dong Li and Yin Li and Dandan Tu and Can Chen and Youliang Yan and Fisher Yu and Ruiming Tang and Yunhe Wang and Botian Huang and Bo Wang and Boxiao Liu and Changzheng Zhang and Da Kuang and Fei Liu and Gang Huang and Jiansheng Wei and Jiarui Qin and Jie Ran and Jinpeng Li and Jun Zhao and Liang Dai and Lin Li and Liqun Deng and Peifeng Qin and Pengyuan Zeng and Qiang Gu and Shaohua Tang and Shengjun Cheng and Tao Gao and Tao Yu and Tianshu Li and Tianyu Bi and Wei He and Weikai Mao and Wenyong Huang and Wulong Liu and Xiabing Li and Xianzhi Yu and Xueyu Wu and Xu He and Yangkai Du and Yan Xu and Ye Tian and Yimeng Wu and Yongbing Huang and Yong Tian and Yong Zhu and Yue Li and Yufei Wang and Yuhang Gai and Yujun Li and Yu Luo and Yunsheng Ni and Yusen Sun and Zelin Chen and Zhe Liu and Zhicheng Liu and Zhipeng Tu and Zilin Ding and Zongyuan Zhan },
  journal={arXiv preprint arXiv:2505.04519},
  year={ 2025 }
}

Comments on this paper