The pre-trained language models like BERT, though powerful in many natural language processing tasks, are both computation and memory expensive. To alleviate this problem, one approach is to compress them for specific tasks before deployment. However, recent works on BERT compression usually compress the large BERT model to a fixed smaller size. They can not fully satisfy the requirements of different edge devices with various hardware performances. In this paper, we propose a novel dynamic BERT model (abbreviated as DynaBERT), which can flexibly adjust the size and latency by selecting adaptive width and depth. The training process of DynaBERT includes first training a width-adaptive BERT and then allowing both adaptive width and depth, by distilling knowledge from the full-sized model to small sub-networks. Network rewiring is also used to keep the more important attention heads and neurons shared by more sub-networks. Comprehensive experiments under various efficiency constraints demonstrate that our proposed dynamic BERT (or RoBERTa) at its largest size has comparable performance as BERT-base (or RoBERTa-base), while at smaller widths and depths consistently outperforms existing BERT compression methods. Code is available at https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/DynaBERT.
翻译:在本文中,我们提出了一个新的动态语言模型(以DynaBERT为代表),它可以通过选择适应性宽度和深度来灵活调整大小和宽度。DynABERT的培训过程包括:首先培训宽度适应性BERT,然后允许适应性宽度和深度,具体做法是从全尺寸模型中提取知识到小型子网络。网络的重新布线还被用来让更多的子网络共享更重要的注意力头和神经元。在各种效率限制下进行的全面实验表明,我们提议的动态BERT(或RiveBERTA)在最大规模上可以与BERT(或RoBERTA)基地(或ROBERTA)相比,同时在较小的宽度和深度中,通过从全尺寸模型中提取知识到小型子网络,同时允许适应性和深度的宽度和深度。网络的重新布线还被用来保持更多的网络共享的更重要的注意力和神经元。在各种效率限制下的全面试验表明,我们提议的动态BERT(或RBERTA)的大小可以与BTA/BERB/BERB/CRADRSARSARSDAximstrasmax