The Mixture-of-experts (MoE) architecture is showing promising results in multi-task learning (MTL) and in scaling high-capacity neural networks. State-of-the-art MoE models use a trainable sparse gate to select a subset of the experts for each input example. While conceptually appealing, existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods. In this paper, we develop DSelect-k: the first, continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation. Our gate can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select. We demonstrate the effectiveness of DSelect-k in the context of MTL, on both synthetic and real datasets with up to 128 tasks. Our experiments indicate that MoE models based on DSelect-k can achieve statistically significant improvements in predictive and expert selection performance. Notably, on a real-world large-scale recommender system, DSelect-k achieves over 22% average improvement in predictive performance compared to the Top-k gate. We provide an open-source TensorFlow implementation of our gate.
翻译:专家混合(Mixture)架构在多任务学习(MTL)和提升高容量神经网络方面显示出有希望的成果。 最先进的MOE模型使用可训练的稀疏门来为每个输入示例选择一组专家。 虽然在概念上具有吸引力,但Top-k等现有的稀疏门并不平滑。 在使用基于梯度的方法进行培训时,缺乏顺畅可导致趋同和统计性能问题。 在本文中,我们开发了DSelect-k:基于新颖的二进制配方,为MOE开发了第一个、持续不同和稀少的大门。我们家门可以使用一阶方法,例如随机梯度梯度梯度梯度下降,对需要选择的专家数量进行明确控制。 我们展示了MTL中DSelect-k在合成和真实数据集方面的有效性,任务达128项。 我们的实验表明,基于DSelect-k的MOEE模型可以在统计上显著改进预测和专家选择性表现。 明显地, 在现实世界的22号大门系统上, 提供了一种可比较的Sloveal- chem- sloveal- sal- supal- supal- supal- supal- supal- supal- supal- supal- suplupal