Human education system trains one student by multiple experts. Mixture-of-experts (MoE) is a powerful sparse architecture including multiple experts. However, sparse MoE model is easy to overfit, hard to deploy, and not hardware-friendly for practitioners. In this work, inspired by the human education model, we propose a novel task, knowledge integration, to obtain a dense student model (OneS) as knowledgeable as one sparse MoE. We investigate this task by proposing a general training framework including knowledge gathering and knowledge distillation. Specifically, to gather key knowledge from different pre-trained experts, we first investigate four different possible knowledge gathering methods, \ie summation, averaging, Top-K Knowledge Gathering (Top-KG), and Singular Value Decomposition Knowledge Gathering (SVD-KG) proposed in this paper. We then refine the dense student model by knowledge distillation to offset the noise from gathering. On ImageNet, our OneS preserves $61.7\%$ benefits from MoE and achieves $78.4\%$ top-1 accuracy ImageNet with only $15$M parameters. On four natural language processing datasets, OneS obtains $88.2\%$ MoE benefits and outperforms the best baseline by $51.7\%$ using the same architecture and training data. In addition, compared with the MoE counterpart, OneS can achieve $3.7 \times$ inference speedup due to less computation and hardware-friendly architecture.
翻译:由多个专家培训一名学生。 Mixture of experts (MoE) 是一个强大的稀有架构,包括多位专家。然而,稀有的教育部模式很容易过度、难以部署,对实践者来说也不容易。在这项工作中,在人文教育模式的启发下,我们提出一个新的任务,即知识整合,以获得像一个稀疏的教育部那样知识丰富的学生模式。我们通过提出一个包括知识收集和知识蒸馏在内的一般培训框架来调查这项任务。具体地说,为了从不同预先培训的专家那里收集关键知识,我们首先调查四种不同的可能的知识收集方法,即: \ iet-K-KG, 平均、 最高-K知识收集(Top-KG) 和 Singultanal valent Connection Connessing (SVVD-KG) 。我们然后通过知识蒸馏来改进密集的学生模式,以抵消收集的噪音。在图像网上,我们的一个S 保存$ $ $ $ $ 。我们保存了教育部的61\ $ $ $ $ $ 并实现了78.4\ $ $ $ $ 顶级图像网络, $ $ $ $ $ dirfillimmet Net et et net net net Net, et with on on on on on on on on on on cre gre gre gre gre gre gre gre gre grefrifrifrifulation, ex gregrefulation, 参数只有$$$$$$$$$ 1 参数。