Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead. It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations. However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse. In this work, we propose to estimate the routing scores between tokens and experts on a low-dimensional hypersphere. We conduct extensive experiments on cross-lingual language model pre-training and fine-tuning on downstream tasks. Experimental results across seven multilingual benchmarks show that our method achieves consistent gains. We also present a comprehensive analysis on the representation and routing behaviors of our models. Our method alleviates the representation collapse issue and achieves more consistent routing than the baseline mixture-of-experts methods.
翻译:分散的专家混合提供了更大的模型能力,同时需要经常性的计算间接费用;它使用路由机制,根据专家的隐形表现向最匹配的专家分发输入的标牌;然而,了解这种路由机制会鼓励围绕专家的中央机器人进行象征性的集聚,意味着代表性的崩溃趋势;在这项工作中,我们提议估计代号与专家在低维超视距上的分数;我们在跨语言的示范培训前和下游任务的微调方面进行了广泛的实验;七个多语言基准的实验结果显示,我们的方法取得了一致的收益;我们还对我们模型的代表性和路由行为进行了全面分析;我们的方法减轻了代号崩溃问题,并实现了比基线专家混合方法更加一致的路线。