Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost. Fortunately, we observe that most inputs only activate a tiny ratio of neurons of large Transformer-based models during inference. Hence, we propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication, which could accelerate large-model inference by conditional computation based on the sparse activation phenomenon. MoEfication consists of two steps: (1) splitting the parameters of feed-forward neural networks (FFNs) into multiple parts as experts, and (2) building expert routers to decide which experts will be used for each input. Experimental results show that the MoEfied models can significantly reduce computation cost, e.g., only activating 20% FFN parameters of a 700-million-parameter model without performance degradation on several downstream tasks including text classification and machine reading comprehension.
翻译:由于参数容量大,以变异器为基础的预先培训语言模型可以在大多数非LLP任务上取得优异的性能,但也会导致巨大的计算成本。幸运的是,我们发现,大多数投入在推论期间只能激活大型变异器模型的微小神经元比例。 因此,我们提议将一个大模型转换成其模型大小相等的专家混合版,即教育部,这可以通过根据稀疏的激活现象进行有条件的计算来加速大型模型推论。 教育部包括两个步骤:(1) 将进料向神经网络(FFNs)的参数分成多个部分,以及(2) 建立专家路由器,以决定每项投入将使用哪些专家。实验结果表明,教育部模型可以大幅降低计算成本,例如,在包括文字分类和机器阅读理解在内的若干下游任务上,只能启用700万分立模型的20%FFN参数,而不会造成性能退化。