高效BERT:通过暖化知识蒸馏逐步搜索多层受控器 (EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation)

Pre-trained language models have shown remarkable results on various NLP tasks. Nevertheless, due to their bulky size and slow inference speed, it is hard to deploy them on edge devices. In this paper, we have a critical insight that improving the feed-forward network (FFN) in BERT has a higher gain than improving the multi-head attention (MHA) since the computational cost of FFN is 2$\sim$3 times larger than MHA. Hence, to compact BERT, we are devoted to designing efficient FFN as opposed to previous works that pay attention to MHA. Since FFN comprises a multilayer perceptron (MLP) that is essential in BERT optimization, we further design a thorough search space towards an advanced MLP and perform a coarse-to-fine mechanism to search for an efficient BERT architecture. Moreover, to accelerate searching and enhance model transferability, we employ a novel warm-up knowledge distillation strategy at each search stage. Extensive experiments show our searched EfficientBERT is 6.9$\times$ smaller and 4.4$\times$ faster than BERT$\rm_{BASE}$, and has competitive performances on GLUE and SQuAD Benchmarks. Concretely, EfficientBERT attains a 77.7 average score on GLUE \emph{test}, 0.7 higher than MobileBERT$\rm_{TINY}$, and achieves an 85.3/74.5 F1 score on SQuAD v1.1/v2.0 \emph{dev}, 3.2/2.7 higher than TinyBERT$_4$ even without data augmentation. The code is released at https://github.com/cheneydon/efficient-bert.

翻译：培训前语言模型在各种NLP任务方面显示了显著的成果。然而,由于它们体积大,发酵速度缓慢,很难在边缘设备上部署它们。在本文中,我们有一个至关重要的洞察力,即改进BERT的Fef-forward网络(FFN)比提高多头关注率(MHA)要高得多,因为FFFF的计算成本比MA高出2美元3倍。因此,对于紧凑的BER来说,我们致力于设计高效FFFFFFN,而不是以往关注MA的工程。由于FFNFM包含一个在BER优化中必不可少的多层过重的多层过重的显示器(MLP),我们进一步设计了一个彻底的搜索空间,用于高级MLP(FFN),并运行一个粗略的搜索机制来寻找高效的BERT结构。此外,我们在每个搜索阶段都采用了一种新型的热度知识蒸馏战略。广泛的实验显示,我们的搜索高效BERT值比0.69美元小, 和4.4美元比B在BER$GEM=r=r=SBSAD的平均水平上,在SUAD=eal=l=eal=xxxxx。

相关内容

深度前馈网络

关注 6

深度前馈网络（deep feedforward network），也叫做前馈神经网络（feedforward neural network）或者多层感知机（multilayer perceptron, MLP）,是典型的深度学习模型。前馈网络的目标是近似某个函数 f^∗ 。例如，对于分类器，y = f^∗ (x)将输入x映射到一个类别y。前馈网络定义了一个映射y = f (x; θ)，并且学习参数θ的值使它能够得到最佳的函数近似。

【NeurIPS2020-华为】DynaBERT:具有自适应宽度和深度的动态BERT

专知会员服务

19+阅读 · 2020年10月21日

【WWW 2020 】基于关系对抗网络的低资源知识图谱补全，Relation Adversarial Network for Low Resource Knowledge Graph Completion

专知会员服务

37+阅读 · 2020年6月7日

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

专知会员服务

195+阅读 · 2020年5月31日

【微软-ACL2020】TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

专知会员服务

36+阅读 · 2020年4月14日