35

Multi-Head LatentMoE and Head Parallel: Communication-Efficient and Deterministic MoE Parallelism

Chenwei Cui
Rockwell Jackson
Benjamin Joseph Herrera
Ana María Tárano
Hannah Kerner
Main:7 Pages
7 Figures
Bibliography:3 Pages
7 Tables
Appendix:2 Pages
Abstract

Large language models have transformed many applications but remain expensive to train. Sparse Mixture of Experts (MoE) addresses this through conditional computation, with Expert Parallel (EP) as the standard distributed training method. However, EP has three limitations: communication cost grows linearly with the number of activated experts kk, load imbalance affects latency and memory usage, and data-dependent communication requires metadata exchange. We propose Multi-Head LatentMoE and Head Parallel (HP), a new architecture and parallelism achieving O(1)O(1) communication cost regardless of kk, completely balanced traffic, and deterministic communication, all while remaining compatible with EP. To accelerate Multi-Head LatentMoE, we propose IO-aware routing and expert computation. Compared to MoE with EP, Multi-Head LatentMoE with HP trains up to 1.61×1.61\times faster while having identical performance. With doubled granularity, it achieves higher overall performance while still being 1.11×1.11\times faster. Our method makes multi-billion-parameter foundation model research more accessible.

View on arXiv
Comments on this paper