Recently, Mixture of Experts (MoE) Transformers have garnered increasing attention due to their advantages in model capacity and computational efficiency. However, studies have indicated that MoE Transformers underperform vanilla Transformers in many downstream tasks, significantly diminishing the practical value of MoE models. To explain this issue, we propose that the pre-training performance and transfer capability of a model are joint determinants of its downstream task performance. MoE models, in comparison to vanilla models, have poorer transfer capability, leading to their subpar performance in downstream tasks. To address this issue, we introduce the concept of transfer capability distillation, positing that although vanilla models have weaker performance, they are effective teachers of transfer capability. The MoE models guided by vanilla models can achieve both strong pre-training performance and transfer capability, ultimately enhancing their performance in downstream tasks. We design a specific distillation method and conduct experiments on the BERT architecture. Experimental results show a significant improvement in downstream performance of MoE models, and many further evidences also strongly support the concept of transfer capability distillation. Finally, we attempt to interpret transfer capability distillation and provide some insights from the perspective of model feature.
翻译:近年来,混合专家(Mixture of Experts, MoE)Transformer因其在模型容量和计算效率方面的优势而受到越来越多的关注。然而,研究表明,MoE Transformer在许多下游任务中表现不及标准Transformer,这显著降低了MoE模型的实际应用价值。为解释这一问题,我们提出模型的预训练性能和迁移能力共同决定了其在下游任务中的表现。相较于标准模型,MoE模型具有较差的迁移能力,导致其在下游任务中表现不佳。为解决此问题,我们引入了迁移能力蒸馏的概念,认为尽管标准模型性能较弱,但它们是迁移能力的有效教师。在标准模型指导下训练的MoE模型可以同时获得强大的预训练性能和迁移能力,最终提升其在下游任务中的表现。我们设计了一种具体的蒸馏方法,并在BERT架构上进行了实验。实验结果表明,MoE模型的下游性能得到显著改善,多项进一步证据也强有力地支持了迁移能力蒸馏的概念。最后,我们尝试从模型特征的角度解释迁移能力蒸馏,并提供了一些见解。