ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
Mixture-of-Experts (MoE) Transformer, the backbone architecture of multiple phenomenal language models, leverages sparsity by activating only a fraction of model parameters for each input token. The sparse structure, while allowing constant time costs, results in space inefficiency: we still need to load all the model parameters during inference. We introduce ResMoE, an innovative MoE approximation framework that utilizes Wasserstein barycenter to extract a common expert (barycenter expert) and approximate the residuals between this barycenter expert and the original ones. ResMoE enhances the space efficiency for inference of large-scale MoE Transformers in a one-shot and data-agnostic manner without retraining while maintaining minimal accuracy loss, thereby paving the way for broader accessibility to large language models. We demonstrate the effectiveness of ResMoE through extensive experiments on Switch Transformer, Mixtral, and DeepSeekMoE models. The results show that ResMoE can reduce the number of parameters in an expert by up to 75% while maintaining comparable performance. The code is available atthis https URL.
View on arXiv@article{ai2025_2503.06881, title={ ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration }, author={ Mengting Ai and Tianxin Wei and Yifan Chen and Zhichen Zeng and Ritchie Zhao and Girish Varatkar and Bita Darvish Rouhani and Xianfeng Tang and Hanghang Tong and Jingrui He }, journal={arXiv preprint arXiv:2503.06881}, year={ 2025 } }