Mixture-of-Experts (MoE) showcases tremendous potential to scale large language models (LLMs) with enhanced performance and reduced computational complexity. However, its sparsely activated architecture shifts feed-forward networks (FFNs) from being compute-intensive to memory-intensive during inference, leading to substantially lower GPU utilization and increased operational costs. We present MegaScale-Infer, an efficient and cost-effective system for serving large-scale MoE models. MegaScale-Infer disaggregates attention and FFN modules within each model layer, enabling independent scaling, tailored parallelism strategies, and heterogeneous deployment for both modules. To fully exploit disaggregation in the presence of MoE's sparsity, MegaScale-Infer introduces ping-pong pipeline parallelism, which partitions a request batch into micro-batches and shuttles them between attention and FFNs for inference. Combined with distinct model parallelism for each module, MegaScale-Infer effectively hides communication overhead and maximizes GPU utilization. To adapt to disaggregated attention and FFN modules and minimize data transmission overhead (e.g., token dispatch), MegaScale-Infer provides a high-performance M2N communication library that eliminates unnecessary GPU-to-CPU data copies, group initialization overhead, and GPU synchronization. Experimental results indicate that MegaScale-Infer achieves up to 1.90x higher per-GPU throughput than state-of-the-art solutions.
View on arXiv@article{zhu2025_2504.02263, title={ MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism }, author={ Ruidong Zhu and Ziheng Jiang and Chao Jin and Peng Wu and Cesar A. Stuardo and Dongyang Wang and Xinlei Zhang and Huaping Zhou and Haoran Wei and Yang Cheng and Jianzhe Xiao and Xinyi Zhang and Lingjun Liu and Haibin Lin and Li-Wen Chang and Jianxi Ye and Xiao Yu and Xuanzhe Liu and Xin Jin and Xin Liu }, journal={arXiv preprint arXiv:2504.02263}, year={ 2025 } }