Pre-trained language models have demonstrated superior performance in various natural language processing tasks. However, these models usually contain hundreds of millions of parameters, which limits their practicality because of latency requirements in real-world applications. Existing methods train small compressed models via knowledge distillation. However, performance of these small models drops significantly compared with the pre-trained models due to their reduced model capacity. We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed. We initialize MoEBERT by adapting the feed-forward neural networks in a pre-trained model into multiple experts. As such, representation power of the pre-trained model is largely retained. During inference, only one of the experts is activated, such that speed can be improved. We also propose a layer-wise distillation method to train MoEBERT. We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks. Results show that the proposed method outperforms existing task-specific distillation algorithms. For example, our method outperforms previous approaches by over 2% on the MNLI (mismatched) dataset. Our code is publicly available at https://github.com/SimiaoZuo/MoEBERT.
翻译:培训前语言模型在各种自然语言处理任务中表现优异,然而,这些模型通常包含数亿个参数,由于实际应用中的隐蔽要求,限制了这些参数的实际性。现有方法通过知识蒸馏培训小型压缩模型。然而,这些小模型的性能与培训前模型相比,由于模型能力降低,其性能明显下降。我们建议采用混合研究模型结构,使用混合研究结构提高模型能力和推断速度。我们通过将预培训模型中的饲料-向神经网络调整成多位专家来初始化MOEBERT。因此,基本上保留了预先培训模型的代表性。在推断过程中,只有一位专家能够被激活,这样可以提高速度。我们还提议了一个多层次的蒸馏方法来培训MOEBERT。我们验证了MOEBERT在自然语言理解和回答任务方面的效率和效力。结果显示,拟议方法比现有的特定任务蒸馏算法要优。例如,我们的方法超越了AMSI/MERB在可公开使用的数据格式中的前方法。