Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost. Fortunately, we find by empirical study that, most inputs only activate a tiny ratio of neurons during inference. Hence, we explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon. We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication. Model MoEfication consists of two steps: (1) splitting the parameters of feed-forward neural networks (FFNs) into multiple parts as experts, and (2) building expert routers to decide which experts will be used for each input. To further improve the performance of MoEfied models, we can also fine-tune the models on downstream tasks, namely parameter calibration. Experimental results show that the MoEfied models can significantly reduce computation cost, e.g., only activating 20% FFN parameters of a 700-million-parameter model without performance degradation on several downstream tasks including text classification and reading comprehension.
翻译:由于参数容量大,以变异器为基础的预先培训语言模型可以在大多数非LLP任务上取得优异的绩效,但同时也会导致巨大的计算成本。幸运的是,我们通过实验研究发现,大多数投入在推论期间只激活了微小的神经元比例。因此,我们探索通过基于稀疏激活现象的有条件计算加快大型模型推导速度。我们提议将一个大型模型转换成具有同等模型大小的专家混合版,即教育部。模型Mofication由两步组成:(1) 将进料向神经网络的参数分成多个部分,作为专家,(2) 建立专家路由器,以决定每项投入将使用哪些专家。为了进一步改善MOEfied模型的性能,我们还可以对下游任务模型进行微调,即参数校准。实验结果表明,MEfied模型可以大幅降低计算成本,例如,只有20%的FNFM参数,在包括文本分类和阅读在内的若干下游任务上,才能启用700万立方码模型的20%的参数。