Large sparsely-activated models have obtained excellent performance in multiple domains. However, such models are typically trained on a single modality at a time. We present the Language-Image MoE, LIMoE, a sparse mixture of experts model capable of multimodal learning. LIMoE accepts both images and text simultaneously, while being trained using a contrastive loss. MoEs are a natural fit for a multimodal backbone, since expert layers can learn an appropriate partitioning of modalities. However, new challenges arise; in particular, training stability and balanced expert utilization, for which we propose an entropy-based regularization scheme. Across multiple scales, we demonstrate remarkable performance improvement over dense models of equivalent computational cost. LIMoE-L/16 trained comparably to CLIP-L/14 achieves 78.6% zero-shot ImageNet accuracy (vs. 76.2%), and when further scaled to H/14 (with additional data) it achieves 84.1%, comparable to state-of-the-art methods which use larger custom per-modality backbones and pre-training schemes. We analyse the quantitative and qualitative behavior of LIMoE, and demonstrate phenomena such as differing treatment of the modalities and the organic emergence of modality-specific experts.
翻译:然而,这些模型通常一次就单一模式进行培训。我们展示了语言图像MoE、LIMOE,这是能够多式学习的一种分散的专家模式。LIMOE同时接受图像和文本,同时接受图像和文本,同时接受对比性损失的培训。教育部自然适合多式联运主干,因为专家层可以学习适当分配模式。然而,新的挑战出现,特别是培训稳定性和平衡的专家利用,为此我们提议了一个基于酶的正规化计划。在多个尺度上,我们展示了比类似计算成本的密集模型显著的业绩改进。LIMO-L/16所培训的与CLIP-L/14相匹配的78.6%零光图像网络准确度(v. 76.2%),在进一步扩展到H/14(有额外数据)时,它达到84.1%,可与使用较大型的习惯的每个模式主干线和预培训计划相比。我们分析了LIMOE的定量和定性行为,并展示了不同模式的有机模式的出现。