The Mixture-of-Experts (MoE) layer, a sparsely-activated model controlled by a router, has achieved great success in deep learning. However, the understanding of such architecture remains elusive. In this paper, we formally study how the MoE layer improves the performance of neural network learning and why the mixture model will not collapse into a single model. Our empirical results suggest that the cluster structure of the underlying problem and the non-linearity of the expert are pivotal to the success of MoE. To further understand this, we consider a challenging classification problem with intrinsic cluster structures, which is hard to learn using a single expert. Yet with the MoE layer, by choosing the experts as two-layer nonlinear convolutional neural networks (CNNs), we show that the problem can be learned successfully. Furthermore, our theory shows that the router can learn the cluster-center features, which helps divide the input complex problem into simpler linear classification sub-problems that individual experts can conquer. To our knowledge, this is the first result towards formally understanding the mechanism of the MoE layer for deep learning.
翻译:专家混合(MoE)层是一个由路由器控制的少许活跃的模式,在深层学习中取得了巨大成功。然而,对于这种结构的理解仍然难以实现。在本文中,我们正式研究了MOE层如何改善神经网络学习的性能,以及混合模型为何不会崩溃为单一模型。我们的经验结果表明,根本问题的集群结构和专家的不线性对于MOE的成功至关重要。为了进一步理解这一点,我们考虑了内在集群结构中具有挑战性的分类问题,而这种结构很难用单一的专家来学习。然而,在MOE层中,我们选择专家作为两层非线性神经网络(CNNs),表明这一问题是可以成功学会的。此外,我们的理论表明,路由器可以学习集束中心特征,这有助于将输入的复杂问题分为个人专家能够理解的简单线性分类子问题。我们知道,这是正式理解MOE层的深层学习机制的第一个结果。