Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling

6 March 2025

Yan Li

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)Github (319★)

Main:9 Pages

10 Figures

Bibliography:5 Pages

6 Tables

Appendix:6 Pages

Abstract

Prevailing LLM serving engines employ expert parallelism (EP) to implement multi-device inference of massive MoE models. However, the efficiency of expert parallel inference is largely bounded by inter-device communication, as EP embraces expensive all-to-all collectives to route tokens to the remote experts if not collocating on the same GPU/NPU device. Nevertheless, state-of-the-art schemes treat expert device-placement and request (or token) device-scheduling as separate concerns, triggering excessive communication between them and compromising inference efficiency

View on arXiv

Comments on this paper