MOEBERT:从BERT到通过受重要指导的适应措施进行专家混合 (MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation)

Pre-trained language models have demonstrated superior performance in various natural language processing tasks. However, these models usually contain hundreds of millions of parameters, which limits their practicality because of latency requirements in real-world applications. Existing methods train small compressed models via knowledge distillation. However, performance of these small models drops significantly compared with the pre-trained models due to their reduced model capacity. We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed. We initialize MoEBERT by adapting the feed-forward neural networks in a pre-trained model into multiple experts. As such, representation power of the pre-trained model is largely retained. During inference, only one of the experts is activated, such that speed can be improved. We also propose a layer-wise distillation method to train MoEBERT. We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks. Results show that the proposed method outperforms existing task-specific distillation algorithms. For example, our method outperforms previous approaches by over 2% on the MNLI (mismatched) dataset. Our code is publicly available at https://github.com/SimiaoZuo/MoEBERT.

翻译：培训前语言模型在各种自然语言处理任务中表现优异,然而,这些模型通常包含数亿个参数,由于实际应用中的隐蔽要求,限制了这些参数的实际性。现有方法通过知识蒸馏培训小型压缩模型。然而,这些小模型的性能与培训前模型相比,由于模型能力降低,其性能明显下降。我们建议采用混合研究模型结构,使用混合研究结构提高模型能力和推断速度。我们通过将预培训模型中的饲料-向神经网络调整成多位专家来初始化MOEBERT。因此,基本上保留了预先培训模型的代表性。在推断过程中,只有一位专家能够被激活,这样可以提高速度。我们还提议了一个多层次的蒸馏方法来培训MOEBERT。我们验证了MOEBERT在自然语言理解和回答任务方面的效率和效力。结果显示,拟议方法比现有的特定任务蒸馏算法要优。例如,我们的方法超越了AMSI/MERB在可公开使用的数据格式中的前方法。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

计算机科学课程与视频课件合集，Computer Science courses with video lectures

专知会员服务

37+阅读 · 2022年1月24日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日