Mixture-of-experts (MoE) is becoming popular due to its success in improving the model quality, especially in Transformers. By routing tokens with a sparse gate to a few experts that each only contains part of the full model, MoE keeps the model size unchanged and significantly reduces per-token computation, which effectively scales neural networks. However, we found that the current approach of jointly training experts and the sparse gate introduces a negative impact on model accuracy, diminishing the efficiency of expensive large-scale model training. In this work, we proposed Dense-To-Sparse gate (DTS-Gate) for MoE training. Specifically, instead of using a permanent sparse gate, DTS-Gate begins as a dense gate that routes tokens to all experts, then gradually and adaptively becomes sparser while routes to fewer experts. MoE with DTS-Gate naturally decouples the training of experts and the sparse gate by training all experts at first and then learning the sparse gate. Experiments show that compared with the state-of-the-art Switch-Gate in GPT-MoE(1.5B) model with OpenWebText dataset(40GB), DTS-Gate can obtain 2.0x speed-up to reach the same validation perplexity, as well as higher FLOPs-efficiency of a 1.42x speed-up.
翻译:专家混合体(MoE)由于成功地提高了模型质量,特别是在变异器中,正在变得受欢迎。通过向少数专家提供鲜少的标志,让每个专家只包含完整模型的一部分,教育部保持模型大小不变,并大幅降低单体计算,从而有效地扩大了神经网络。然而,我们发现,目前联合培训专家的办法和分散的大门对模型准确性产生了负面影响,降低了昂贵的大型模型培训的效率。在这项工作中,我们提议为教育部培训Dense-to-Sparse门(DTS-Gate)。具体地说,DTS-Gate不是使用永久的稀疏门,而是作为一个密集的大门,将模型作为所有专家的标志,然后逐渐地和适应性地变得稀疏,而向较少的专家传递。DTS-Gate 与DTS-OP-G 自然地分解了专家培训,先先培训所有专家,然后学习稀薄的门。实验显示,与GPT-MO-G(1.5B)的状态转换式G-40高速度模型相比,可以将OPT-O-O-L-LS-S-Slevildal-S-S-S-S-Sy-Slevl化为S-S-Sy-Sy-Sild-Sld-S-Sld-Sy-S-S-Sl-S-S-S-S-S-S-Sl-S-S-S-S-Sl-S-S-Sl-S-S-S-S-Sldal-Sl-S-S-S-S-S-F-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-Sl-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S