One Student Knows All Experts Know: From Sparse to Dense
- MoMeMoE
Human education system trains one student by multiple experts. Mixture-of-experts (MoE) is a powerful sparse architecture including multiple experts. However, sparse MoE model is hard to implement, easy to overfit, and not hardware-friendly. In this work, inspired by human education model, we propose a novel task, knowledge integration, to obtain a dense student model (OneS) as knowledgeable as one sparse MoE. We investigate this task by proposing a general training framework including knowledge gathering and knowledge distillation. Specifically, we first propose Singular Value Decomposition Knowledge Gathering (SVD-KG) to gather key knowledge from different pretrained experts. We then refine the dense student model by knowledge distillation to offset the noise from gathering. On ImageNet, our OneS preserves benefits from MoE. OneS can achieve top-1 accuracy with only M parameters. On four natural language processing datasets, OneS obtains MoE benefits and outperforms SoTA by using the same architecture and training data. In addition, compared with the MoE counterpart, OneS can achieve inference speedup due to the hardware-friendly architecture.
View on arXiv