Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead. MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated. However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation. Such problems are especially severe on tasks with limited data, thus hindering the progress for MoE models to improve performance by scaling up. In this work, we propose Mixture of Expert Clusters - a general approach to enable expert layers to learn more diverse and appropriate knowledge by imposing variance-based constraints on the routing stage. We further propose a cluster-level expert dropout strategy specifically designed for the expert cluster structure. Our experiments reveal that MoEC could improve performance on machine translation and natural language understanding tasks, and raise the performance upper bound for scaling up experts under limited data. We also verify that MoEC plays a positive role in mitigating overfitting and sparse data allocation.
翻译:然而,随着专家人数的增加,具有令人发指参数的教育部面临过分配制和分散的数据分配问题。这些问题在涉及数据有限的任务方面尤为严重,从而阻碍了教育部模式通过扩大规模提高绩效的进展。在这项工作中,我们提议了专家分组组合——一种一般性办法,通过对路由阶段施加基于差异的限制,使专家层次能够学习更加多样化和适当的知识。我们进一步建议了专门为专家分组结构设计的集群级专家辍学战略。我们的实验表明,教育部可以改进机器翻译和自然语言理解任务的业绩,提高在有限数据条件下扩大专家的绩效。我们还核实了教育部在减少过度配制和分散数据分配方面发挥着积极作用。