MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
- MoE
Main:13 Pages
15 Figures
Bibliography:4 Pages
4 Tables
Abstract
As large language models continue to scale up, distributed training systems have expanded beyond 10k nodes, intensifying the importance of fault tolerance. Checkpoint has emerged as the predominant fault tolerance strategy, with extensive studies dedicated to optimizing its efficiency. However, the advent of the sparse Mixture-of-Experts (MoE) model presents new challenges due to the substantial increase in model size, despite comparable computational demands to dense models.
View on arXivComments on this paper
