v1v2 (latest)

Scalable Training of Mixture-of-Experts Models with Megatron Core

8 March 2026

Zijie Yan

Hongxiao Bai

Xin Yao

Dennis Liu

Tong Liu

Hongbin Liu

Pingtian Li

Evan Wu

Shiqing Fan

Li Tao

Robin Zhang

Yuzhong Wang

Shifang Xu

Jack Chang

Xuwen Chen

Kunlun Li

Yan Bai

Gao Deng

Nan Zheng

Vijay Anand Korthikanti

Abhinav Khattar

Ethan He

Soham Govande

Sangkug Lym

Zhongbo Zhu

Qi Zhang

Haochen Yuan

Xiaowei Ren

Deyu Fu

Tailai Ma

Shunkang Zhang

Jiang Shao

Ray Wang

Vasudevan Rengasamy

Rachit Garg

Santosh Bhavani

Xipeng Li

Chandler Zhou

David Wu

Yingcan Wei

Ashwath Aithal

Michael Andersch

Mohammad Shoeybi

Jiajie Yao

June Yang

MoE

ArXiv (abs)PDF HTML Github (15591★)

Main:83 Pages

47 Figures

Bibliography:1 Pages

26 Tables

Appendix:4 Pages

Abstract

Scaling Mixture-of-Experts (MoE) training introduces systems challenges absent in dense models. Because each token activates only a subset of experts, this sparsity allows total parameters to grow much faster than per-token computation, creating coupled constraints across memory, communication, and computation. Optimizing one dimension often shifts pressure to another, demanding co-design across the full system stack.We address these challenges for MoE training through integrated optimizations spanning memory (fine-grained recomputation, offloading, etc.), communication (optimized dispatchers, overlapping, etc.), and computation (Grouped GEMM, fusions, CUDA Graphs, etc.). The framework also provides Parallel Folding for flexible multi-dimensional parallelism, low-precision training support for FP8 and NVFP4, and efficient long-context training. On NVIDIA GB300 and GB200, it achieves 1,233/1,048 TFLOPS/GPU for DeepSeek-V3-685B and 974/919 TFLOPS/GPU for Qwen3-235B. As a performant, scalable, and production-ready open-source solution, it has been used across academia and industry for training MoE models ranging from billions to trillions of parameters on clusters scaling up to thousands of GPUs.This report explains how these techniques work, their trade-offs, and their interactions at the systems level, providing practical guidance for scaling MoE models with Megatron Core.

View on arXiv

Comments on this paper