CoSMoEs: Compact Sparse Mixture of Experts
Abstract
Sparse Mixture of Expert (MoE) models are popular foundational architectures at large scale, however, under-explored at smaller sizes. Here, we show how to enable Compact Sparse Mixture of Experts (CoSMoEs) for on-device inference. Specifically, we tackle the three main on-device dimensions: Quality, Memory and Latency. Along the quality axis, we show that in a fair evaluation (removing confounding factors) MoE architectures outperform FLOP-aligned dense models at on-device scale. We introduce weight-decomposed experts, further improving the MoE model performance. Regarding model memory and latency, we significantly improve model offloading efficiency and, in turn, reduce model inference latency.
View on arXiv@article{huber2025_2503.00245, title={ CoSMoEs: Compact Sparse Mixture of Experts }, author={ Patrick Huber and Akshat Shrivastava and Ernie Chang and Chinnadhurai Sankar and Ahmed Aly and Adithya Sagar }, journal={arXiv preprint arXiv:2503.00245}, year={ 2025 } }
Comments on this paper