Sparsely-gated Mixture of Expert (MoE) layers have been recently successfully applied for scaling large transformers, especially for language modeling tasks. An intriguing side effect of sparse MoE layers is that they convey inherent interpretability to a model via natural expert specialization. In this work, we apply sparse MoE layers to CNNs for computer vision tasks and analyze the resulting effect on model interpretability. To stabilize MoE training, we present both soft and hard constraint-based approaches. With hard constraints, the weights of certain experts are allowed to become zero, while soft constraints balance the contribution of experts with an additional auxiliary loss. As a result, soft constraints handle expert utilization better and support the expert specialization process, while hard constraints maintain more generalized experts and increase overall model performance. Our findings demonstrate that experts can implicitly focus on individual sub-domains of the input space. For example, experts trained for CIFAR-100 image classification specialize in recognizing different domains such as flowers or animals without previous data clustering. Experiments with RetinaNet and the COCO dataset further indicate that object detection experts can also specialize in detecting objects of distinct sizes.
翻译:最近成功地应用了松散的专家混合(MOE)层,以扩大大型变压器的规模,特别是语言模型任务。稀疏的MOE层的一个有趣的副作用是,它们通过自然的专家专业化向模型传递内在的解释性。在这项工作中,我们将稀疏的MOE层用于计算机视觉任务,并分析对模型解释性的影响。为了稳定对MOE的培训,我们提出了软的和硬的制约性方法。在困难的限制下,允许某些专家的权重变为零,而软的制约使专家的贡献与额外的辅助损失相平衡。因此,软的制约更好地处理专家的利用,支持专家专业化进程,同时硬性限制保持了专家的普遍性,提高了总体模型性能。我们的研究结果表明,专家可以隐含地侧重于输入空间的单个子域。例如,受过CIFAR-100图像分类培训的专家专门识别不同领域,如花卉或动物,而无需数据集束。与RetinanNet和CO数据集的实验进一步表明,物体探测专家也可以专门探测不同大小的物体。