85

Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts

Main:8 Pages
8 Figures
Bibliography:6 Pages
4 Tables
Appendix:4 Pages
Abstract

Mixture-of-Experts (MoE) models typically fix the number of activated experts kk at both training and inference. Intuitively, activating more experts at inference kk' (where k>kk'> k) means engaging a larger set of model parameters for the computation and thus is expected to improve performance. However, contrary to this intuition, we find the scaling range to be so narrow that performance begins to degrade rapidly after only a slight increase in the number of experts. Further investigation reveals that this degradation stems from a lack of learned collaboration among experts. To address this, we introduce Elastic Mixture-of-Experts (EMoE), a novel training framework that enables MoE models to scale the number of activated experts at inference without incurring additional training overhead. By simultaneously training experts to collaborate in diverse combinations and encouraging the router for high-quality selections, EMoE ensures robust performance across computational budgets at inference. We conduct extensive experiments on various MoE settings. Our results show that EMoE significantly expands the effective performance-scaling range, extending it to as much as 2-3×\times the training-time kk, while also pushing the model's peak performance to a higher level.

View on arXiv
Comments on this paper