BERT has achieved superior performances on Natural Language Understanding (NLU) tasks. However, BERT possesses a large number of parameters and demands certain resources to deploy. For acceleration, Dynamic Early Exiting for BERT (DeeBERT) has been proposed recently, which incorporates multiple exits and adopts a dynamic early-exit mechanism to ensure efficient inference. While obtaining an efficiency-performance tradeoff, the performances of early exits in multi-exit BERT are significantly worse than late exits. In this paper, we leverage gradient regularized self-distillation for RObust training of Multi-Exit BERT (RomeBERT), which can effectively solve the performance imbalance problem between early and late exits. Moreover, the proposed RomeBERT adopts a one-stage joint training strategy for multi-exits and the BERT backbone while DeeBERT needs two stages that require more training time. Extensive experiments on GLUE datasets are performed to demonstrate the superiority of our approach. Our code is available at https://github.com/romebert/RomeBERT.
翻译:在自然语言理解(NLU)任务方面,BERT取得了优异的成绩。然而,BERT拥有大量参数,要求部署某些资源。关于加速,最近提出了BERT(DeeBERT)动态早期退出(DeeBERT)的建议,其中包括多重退出和采用动态提前退出机制,以确保有效的推论。在获得效率-绩效权衡时,多出口BERT早期退出的绩效比晚期退出要差得多。在本文件中,我们利用梯度固定化的自我提炼来进行多出口BERT(RomeBERT)的ROBust培训,这可以有效解决早期和晚退出之间的业绩不平衡问题。此外,拟议的RomeBERT为多出口者和BERT的骨干采取了一个阶段的联合培训战略,而DeeBERT需要两个阶段的训练时间,对GLUE数据集进行了广泛的试验,以证明我们的方法的优越性。我们的代码可在https://github.com/romebert/RomeBERT。