Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling
- MoE
Main:9 Pages
10 Figures
Bibliography:5 Pages
6 Tables
Appendix:6 Pages
Abstract
Prevailing LLM serving engines employ expert parallelism (EP) to implement multi-device inference of massive MoE models. However, the efficiency of expert parallel inference is largely bounded by inter-device communication, as EP embraces expensive all-to-all collectives to route tokens to the remote experts if not collocating on the same GPU/NPU device. Nevertheless, state-of-the-art schemes treat expert device-placement and request (or token) device-scheduling as separate concerns, triggering excessive communication between them and compromising inference efficiency
View on arXivComments on this paper
