Mixtures of Experts (MoE) are known for their ability to learn complex conditional distributions with multiple modes. However, despite their potential, these models are challenging to train and often tend to produce poor performance, explaining their limited popularity. Our hypothesis is that this under-performance is a result of the commonly utilized maximum likelihood (ML) optimization, which leads to mode averaging and a higher likelihood of getting stuck in local maxima. We propose a novel curriculum-based approach to learning mixture models in which each component of the MoE is able to select its own subset of the training data for learning. This approach allows for independent optimization of each component, resulting in a more modular architecture that enables the addition and deletion of components on the fly, leading to an optimization less susceptible to local optima. The curricula can ignore data-points from modes not represented by the MoE, reducing the mode-averaging problem. To achieve a good data coverage, we couple the optimization of the curricula with a joint entropy objective and optimize a lower bound of this objective. We evaluate our curriculum-based approach on a variety of multimodal behavior learning tasks and demonstrate its superiority over competing methods for learning MoE models and conditional generative models.
翻译:专家混合模型(MoE)以其学习具有多个模式的复杂条件分布的能力而闻名。然而,尽管它们的潜力巨大,但这些模型往往很难训练,并且通常倾向于产生较差的性能,这解释了它们的有限普及率。我们的假设是,这种不良性能是由于通常采用的最大似然(ML)优化所导致的,该优化会导致模式平均化和更容易被困在局部最大值中。我们提出了一种新颖的基于课程的方法,用于学习混合模型,在该方法中,MoE的每个组件都能够选择自己学习的训练数据子集。这种方法允许对每个组件进行独立优化,从而产生更模块化的结构,使得能够动态地添加和删除组件,从而实现对局部最优解不那么敏感的优化。课程可以忽略未被MoE表示的模式的数据点,以减少模式平均化问题。为了实现良好的数据覆盖率,我们将课程的优化与联合熵目标耦合,并优化该目标的下限。我们在各种多模态行为学习任务上评估了我们的基于课程的方法,并证明了它在学习MoE模型和条件生成模型方面优于竞争方法。