Mixture of experts (MoE), introduced over 20 years ago, is the simplest gated modular neural network architecture. There is renewed interest in MoE because the conditional computation allows only parts of the network to be used during each inference, as was recently demonstrated in large scale natural language processing models. MoE is also of potential interest for continual learning, as experts may be reused for new tasks, and new experts introduced. The gate in the MoE architecture learns task decompositions and individual experts learn simpler functions appropriate to the gate's decomposition. In this paper: (1) we show that the original MoE architecture and its training method do not guarantee intuitive task decompositions and good expert utilization, indeed they can fail spectacularly even for simple data such as MNIST and FashionMNIST; (2) we introduce a novel gating architecture, similar to attention, that improves performance and results in a lower entropy task decomposition; and (3) we introduce a novel data-driven regularization that improves expert specialization. We empirically validate our methods on MNIST, FashionMNIST and CIFAR-100 datasets.
翻译:20多年前引入的专家混合(MoE)是简单的门式模块式神经网络结构,20多年前引入的专家混合(MoE)是简单的门式模块式神经网络结构。由于有条件的计算只允许网络的某些部分在每次推断中使用,正如最近大规模自然语言处理模型所显示的那样,因此对教育部重新产生兴趣。教育部还可能有兴趣继续学习,因为专家可能被重新用于新的任务,并引入新的专家。教育部结构的大门学习任务分解,个人专家学习适合大门分解的更简单功能。在本文中:(1) 我们表明,最初的MOE结构及其培训方法并不能保证单身任务分解和良好的专家利用,事实上,即使对诸如MNSTIS和FAshionMNIST等简单数据,它们也可能大失灵;(2) 我们引入了一种与关注类似的新型结构,可以改进工作绩效,并导致降低诱导任务分解;(3)我们引入一种新的数据驱动正规化,可以改进专家专业化。我们从经验上验证了我们在MNIST、FashonMNIST和CIFAR-100数据集上采用的方法。</s>