The pre-trained language models like BERT and RoBERTa, though powerful in many natural language processing tasks, are both computational and memory expensive. To alleviate this problem, one approach is to compress them for specific tasks before deployment. However, recent works on BERT compression usually reduce the large BERT model to a fixed smaller size, and can not fully satisfy the requirements of different edge devices with various hardware performances. In this paper, we propose a novel dynamic BERT model (abbreviated as DynaBERT), which can run at adaptive width and depth. The training process of DynaBERT includes first training a width-adaptive BERT and then allows both adaptive width and depth, by distilling knowledge from the full-sized model to small sub-networks. Network rewiring is also used to keep the more important attention heads and neurons shared by more sub-networks. Comprehensive experiments under various efficiency constraints demonstrate that our proposed dynamic BERT (or RoBERTa) at its largest size has comparable performance as BERT (or RoBERTa), while at smaller widths and depths consistently outperforms existing BERT compression methods.
翻译:BERT和ROBERTA等经过事先训练的语言模型虽然在许多自然语言处理任务中具有很强的功能,但是在计算和记忆上都是昂贵的。为了缓解这一问题,一种办法是在部署前压缩它们以完成具体任务。然而,最近关于BERT压缩的工程通常将大型BERT模型压缩成一个固定的较小规模,并且不能完全满足不同边缘设备的要求,使用各种硬件性能。在本文中,我们提出了一个新的动态BERT模型(作为DynaBERT),可以在适应性广度和深度上运行。DynABERT的培训过程包括首先培训宽度BERT,然后允许适应性宽度和深度,方法是将全尺寸模型的知识提取到小型子网络。网络重新连接还用来保持更多的子网络所共有的更重要的注意力头和神经元。在各种效率限制下进行的全面实验表明,我们提议的动态BERT(或ROBERTA)在最大尺寸上具有与BERT(或ROBERTA)相似的性能,而在较小的宽度和深度上,而宽度较小的宽度和深度上,同时以现有的BERTERTAR压缩方法相近。