Pre-trained language models have shown remarkable results on various NLP tasks. Nevertheless, due to their bulky size and slow inference speed, it is hard to deploy them on edge devices. In this paper, we have a critical insight that improving the feed-forward network (FFN) in BERT has a higher gain than improving the multi-head attention (MHA) since the computational cost of FFN is 2$\sim$3 times larger than MHA. Hence, to compact BERT, we are devoted to designing efficient FFN as opposed to previous works that pay attention to MHA. Since FFN comprises a multilayer perceptron (MLP) that is essential in BERT optimization, we further design a thorough search space towards an advanced MLP and perform a coarse-to-fine mechanism to search for an efficient BERT architecture. Moreover, to accelerate searching and enhance model transferability, we employ a novel warm-up knowledge distillation strategy at each search stage. Extensive experiments show our searched EfficientBERT is 6.9$\times$ smaller and 4.4$\times$ faster than BERT$\rm_{BASE}$, and has competitive performances on GLUE and SQuAD Benchmarks. Concretely, EfficientBERT attains a 77.7 average score on GLUE \emph{test}, 0.7 higher than MobileBERT$\rm_{TINY}$, and achieves an 85.3/74.5 F1 score on SQuAD v1.1/v2.0 \emph{dev}, 3.2/2.7 higher than TinyBERT$_4$ even without data augmentation. The code is released at https://github.com/cheneydon/efficient-bert.
翻译:培训前语言模型在各种NLP任务方面显示了显著的成果。然而,由于它们体积大,发酵速度缓慢,很难在边缘设备上部署它们。在本文中,我们有一个至关重要的洞察力,即改进BERT的Fef-forward网络(FFN)比提高多头关注率(MHA)要高得多,因为FFFF的计算成本比MA高出2美元3倍。因此,对于紧凑的BER来说,我们致力于设计高效FFFFFFN,而不是以往关注MA的工程。由于FFNFM包含一个在BER优化中必不可少的多层过重的多层过重的显示器(MLP),我们进一步设计了一个彻底的搜索空间,用于高级MLP(FFN),并运行一个粗略的搜索机制来寻找高效的BERT结构。此外,我们在每个搜索阶段都采用了一种新型的热度知识蒸馏战略。 广泛的实验显示,我们的搜索高效BERT值比0.69美元小, 和4.4美元比B在BER$GEM=r=r=SBSAD的平均水平上,在SUAD=eal=l=eal=xxxxx。