9
v1v2 (latest)

Scalable Training of Mixture-of-Experts Models with Megatron Core

Zijie Yan
Hongxiao Bai
Xin Yao
Dennis Liu
Tong Liu
Hongbin Liu
Pingtian Li
Evan Wu
Shiqing Fan
Li Tao
Robin Zhang
Yuzhong Wang
Shifang Xu
Jack Chang
Xuwen Chen
Kunlun Li
Yan Bai
Gao Deng
Nan Zheng
Vijay Anand Korthikanti
Abhinav Khattar
Ethan He
Soham Govande
Sangkug Lym
Zhongbo Zhu
Qi Zhang
Haochen Yuan
Xiaowei Ren
Deyu Fu
Tailai Ma
Shunkang Zhang
Jiang Shao
Ray Wang
Vasudevan Rengasamy
Rachit Garg
Santosh Bhavani
Xipeng Li
Chandler Zhou
David Wu
Yingcan Wei
Ashwath Aithal
Michael Andersch
Mohammad Shoeybi
Jiajie Yao
June Yang
Main:83 Pages
47 Figures
Bibliography:1 Pages
26 Tables
Appendix:4 Pages
Abstract

Scaling Mixture-of-Experts (MoE) training introduces systems challenges absent in dense models. Because each token activates only a subset of experts, this sparsity allows total parameters to grow much faster than per-token computation, creating coupled constraints across memory, communication, and computation. Optimizing one dimension often shifts pressure to another, demanding co-design across the full system stack.We address these challenges for MoE training through integrated optimizations spanning memory (fine-grained recomputation, offloading, etc.), communication (optimized dispatchers, overlapping, etc.), and computation (Grouped GEMM, fusions, CUDA Graphs, etc.). The framework also provides Parallel Folding for flexible multi-dimensional parallelism, low-precision training support for FP8 and NVFP4, and efficient long-context training. On NVIDIA GB300 and GB200, it achieves 1,233/1,048 TFLOPS/GPU for DeepSeek-V3-685B and 974/919 TFLOPS/GPU for Qwen3-235B. As a performant, scalable, and production-ready open-source solution, it has been used across academia and industry for training MoE models ranging from billions to trillions of parameters on clusters scaling up to thousands of GPUs.This report explains how these techniques work, their trade-offs, and their interactions at the systems level, providing practical guidance for scaling MoE models with Megatron Core.

View on arXiv
Comments on this paper