Sparsely activated transformers, such as Mixture of Experts (MoE), have received great interest due to their outrageous scaling capability which enables dramatical increases in model size without significant increases in computational cost. To achieve this, MoE models replace the feedforward sub-layer with Mixture-of-Experts sub-layer in transformers and use a gating network to route each token to its assigned experts. Since the common practice for efficient training of such models requires distributing experts and tokens across different machines, this routing strategy often incurs huge cross-machine communication cost because tokens and their assigned experts likely reside in different machines. In this paper, we propose \emph{Gating Dropout}, which allows tokens to ignore the gating network and stay at their local machines, thus reducing the cross-machine communication. Similar to traditional dropout, we also show that Gating Dropout has a regularization effect during training, resulting in improved generalization performance. We validate the effectiveness of Gating Dropout on multilingual machine translation tasks. Our results demonstrate that Gating Dropout improves a state-of-the-art MoE model with faster wall-clock time convergence rates and better BLEU scores for a variety of model sizes and datasets.
翻译:专家混合( Mixture of Experts) 等异常活跃的变压器因其惊人的缩放能力而备受极大关注,这些变压器能够使模型规模大幅提升,而计算成本却不会大幅提高。为此,教育部模型在变压器中用Mixture of Experters 子层取代进取向下层,并使用一个标签网将每个标志都传送给指派的专家。由于高效培训这些模型的常见做法要求在不同机器中分配专家和标牌,因此,这种路由战略往往带来巨大的跨机器通信成本,因为象征性物及其指派的专家可能居住在不同的机器中。我们在此文件中提议\emph{Gateting dropout},这样可以让标牌忽略加固网络并留在本地机器上,从而减少交叉机器的通信。与传统的辍学相似,我们还表明,Gatetleadout在培训期间具有正规化效应,从而改进了通用性性表现。 我们验证了多语机器翻译任务中加调脱产的实效。 我们的结果表明, 脱压将改进了高调模式的模模型, 和高调制制的光速率的模型将改进了更快的光速率。