Sparsely activated models (SAMs), such as Mixture-of-Experts (MoE), can easily scale to have outrageously large amounts of parameters without significant increase in computational cost. However, SAMs are reported to be parameter inefficient such that larger models do not always lead to better performance. While most on-going research focuses on improving SAMs models by exploring methods of routing inputs to experts, our analysis reveals that such research might not lead to the solution we expect, i.e., the commonly-used routing methods based on gating mechanisms do not work better than randomly routing inputs to experts. In this paper, we propose a new expert-based model, THOR (Transformer witH StOchastic ExpeRts). Unlike classic expert-based models, such as the Switch Transformer, experts in THOR are randomly activated for each input during training and inference. THOR models are trained using a consistency regularized loss, where experts learn not only from training data but also from other experts as teachers, such that all the experts make consistent predictions. We validate the effectiveness of THOR on machine translation tasks. Results show that THOR models are more parameter efficient in that they significantly outperform the Transformer and MoE models across various settings. For example, in multilingual translation, THOR outperforms the Switch Transformer by 2 BLEU scores, and obtains the same BLEU score as that of a state-of-the-art MoE model that is 18 times larger. Our code is publicly available at: https://github.com/microsoft/Stochastic-Mixture-of-Experts.
翻译:尽管大多数正在进行的研究侧重于改进SAM模型,例如Mixture Experts(MoE),可以很容易地扩大规模,从而产生令人发指的庞大参数,而不会大幅提高计算成本。然而,据报告,SAMs的参数效率低下,因此更大的模型并不总是导致更好的业绩。虽然大多数正在进行的研究侧重于通过探索向专家输入路由的方法来改进SAMs模型,但我们的分析表明,这种研究可能不会导致我们所期望的解决方案,即基于星系机制的常用变速转换方法,不会比随机地向专家输送大量的投入更好。在本文件中,我们提出了一个新的基于专家的模型,THOR(THOR) (TOR) (THOR) (THER) (THOH StOcastic ExpeRtts) (THOR) (THOR) (THOR) (THOR) (THOR) (THOR) (THOR) (ML) (MER) (MER) (MER) (TRA) (O-D) (MERL) (MERL) (OL) (OL) (OL) (OUL) (OL) (OL) (OL) (OL) (OL) (O) (OL) (O) (O) (O) (OL) (OL) (O) (O) (O) (O) (OL) (O) (O) (O) (O) (O) (O) (O) (OD) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (OD) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (OD) (OD) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (