The sparse Mixture-of-Experts (MoE) model is powerful for large-scale pre-training and has achieved promising results due to its model capacity. However, with trillions of parameters, MoE is hard to be deployed on cloud or mobile environment. The inference of MoE requires expert parallelism, which is not hardware-friendly and communication expensive. Especially for resource-limited downstream tasks, such sparse structure has to sacrifice a lot of computing efficiency for limited performance gains. In this work, we observe most experts contribute scarcely little to the MoE fine-tuning and inference. We further propose a general method to progressively drop the non-professional experts for the target downstream task, which preserves the benefits of MoE while reducing the MoE model into one single-expert dense model. Our experiments reveal that the fine-tuned single-expert model could preserve 99.3% benefits from MoE across six different types of tasks while enjoying 2x inference speed with free communication cost.
翻译:稀有的专家混合(Mixture of Experters)模式对于大规模培训前的训练十分强大,并因其模型能力而取得了可喜的成果。然而,由于有数万亿参数,教育部很难在云层或移动环境中部署。教育部的推论要求专家平行,这不易硬件使用,通信费用昂贵。特别是对于资源有限的下游任务,这种稀有的结构必须牺牲大量计算效率,以取得有限的绩效收益。在这项工作中,我们观察到大多数专家对教育部的微调和推论贡献很少。我们进一步提出了逐步减少非专业专家从事目标下游任务的一般方法,这种方法既维护教育部模式的好处,又将教育部模式减为单一的专家密集模式。我们的实验表明,微调的单一专家模式可以保存教育部在六种不同任务中的99.3%的利益,同时享有2x的免费通信成本的推断速度。