Mixture of Experts (MoE) is able to scale up vision transformers effectively. However, it requires prohibiting computation resources to train a large MoE transformer. In this paper, we propose Residual Mixture of Experts (RMoE), an efficient training pipeline for MoE vision transformers on downstream tasks, such as segmentation and detection. RMoE achieves comparable results with the upper-bound MoE training, while only introducing minor additional training cost than the lower-bound non-MoE training pipelines. The efficiency is supported by our key observation: the weights of an MoE transformer can be factored into an input-independent core and an input-dependent residual. Compared with the weight core, the weight residual can be efficiently trained with much less computation resource, e.g., finetuning on the downstream data. We show that, compared with the current MoE training pipeline, we get comparable results while saving over 30% training cost. When compared with state-of-the-art non- MoE transformers, such as Swin-T / CvT-13 / Swin-L, we get +1.1 / 0.9 / 1.0 mIoU gain on ADE20K segmentation and +1.4 / 1.6 / 0.6 AP gain on MS-COCO object detection task with less than 3% additional training cost.
翻译:专家混合体(MoE)能够有效地提升视觉变压器。然而,它要求禁止计算资源用于培训大型的MOE变压器。在本文中,我们提出专家剩余混合体(ROME),这是MOE视觉变压器在分化和检测等下游任务方面的高效培训管道。RMOE与上上调的MOE培训相比,取得了可比较的结果,而只引入了比下下调的非MOE培训管道略高的培训成本。效率得到我们关键观察的支持:一个MOE变压器的重量可以纳入一个依赖投入的核心和依赖投入的剩余部分。与重心相比,重余力可以得到高效的培训,而计算资源要少得多,例如,对下游数据进行微调。我们表明,与目前的MoE培训管道相比,我们取得了可比较的结果,同时节省了30%以上的培训费用。与最先进的非MOE变压器相比,例如Swin-T/CvT-13/Swin-L目标的重量,与重心力相比, 剩余部分可以用较少的计算资源,例如计算资源,比0.1.1/MAK+0.1%/MAFEAFAFAFSMAREC+0.1和0.1+0.1和0.1+0.10的飞行的升级的进度,我们在0.1和0.1和0.1+0.AFAFAFAFAFAFAFAFS的飞行增加部分,我们在0.1和0.1和0.1和0.1和0.1和0.1和0.1+0.1和0.1和0.1+0.AFAFAFAFT的0.AFT的0.1/AFT的计算中增加任务中增加成本。