Dynamic early exiting has been proven to improve the inference speed of the pre-trained language model like BERT. However, all samples must go through all consecutive layers before early exiting and more complex samples usually go through more layers, which still exists redundant computation. In this paper, we propose a novel dynamic early exiting combined with layer skipping for BERT inference named SmartBERT, which adds a skipping gate and an exiting operator into each layer of BERT. SmartBERT can adaptively skip some layers and adaptively choose whether to exit. Besides, we propose cross-layer contrastive learning and combine it into our training phases to boost the intermediate layers and classifiers which would be beneficial for early exiting. To keep the consistent usage of skipping gates between training and inference phases, we propose a hard weight mechanism during training phase. We conduct experiments on eight classification datasets of the GLUE benchmark. Experimental results show that SmartBERT achieves 2-3x computation reduction with minimal accuracy drops compared with BERT and our method outperforms previous methods in both efficiency and accuracy. Moreover, in some complex datasets like RTE and WNLI, we prove that the early exiting based on entropy hardly works, and the skipping mechanism is essential for reducing computation.
翻译:早期退出已被证明可以提高预先培训的语言模型(如BERT)的推断速度。然而,所有样本都必须在早期退出之前经过连续的各级,而更复杂的样本通常要经过更多的层,而这些层仍然是多余的计算。在本文中,我们提议一种新的动态早期退出,同时为BERT的推断层跳过一个叫SmartBERT的层层,在BERT的每层中增加一个跳开的门和一个下方操作员。智能BERT可以适应性地跳过一些层,适应性地选择是否退出。此外,我们提议跨层对比性学习并将其纳入我们的培训阶段,以提升中间层和分类器,从而有利于早期退出。为了在培训和推断阶段之间保持对跳出门的一致使用,我们在培训阶段中建议了一个硬重力机制。我们在GLUE基准的八个分类数据集上进行了实验。实验结果表明,SmartBERT的计算结果与BERT和我们的方法相比,在效率与准确性方法上都优于先前的方法。此外,在一些复杂的数据设置中,例如RTE和WNLI的计算中,我们很难证明,我们正在对基本的计算机制的早期进行。</s>