Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive and memory intensive, so it is difficult to effectively execute them on some resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we firstly propose a novel transformer distillation method that is a specially designed knowledge distillation (KD) method for transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be well transferred to a small student TinyBERT. Moreover, we introduce a new two-stage learning framework for TinyBERT, which performs transformer distillation at both the pre-training and task-specific learning stages. This framework ensures that TinyBERT can capture both the general-domain and task-specific knowledge of the teacher BERT. TinyBERT is empirically effective and achieves comparable results with BERT in GLUE datasets, while being 7.5x smaller and 9.4x faster on inference. TinyBERT is also significantly better than state-of-the-art baselines, even with only about 28% parameters and 31% inference time of baselines.
翻译:语言模型预培训,如BERT, 大大改善了许多自然语言处理任务的绩效。然而, 预先培训的语言模型通常在计算上费用昂贵,记忆密集,因此难以在一些资源限制的装置上有效地执行。 要加快推论,降低模型大小,同时保持准确性,我们首先提议一种新型变压器蒸馏方法,这是为基于变压器的模型专门设计的知识蒸馏(KD)方法。通过利用这种新的KD方法,在大型教师BERT中编码的大量知识可以很好地转让给一个小型学生TinyBERT。此外,我们为TyyBERT引入一个新的两阶段学习框架,在培训前和具体任务学习阶段进行变压。这个框架确保TinyBERT能够捕捉到教师BERT的普通和特定任务知识蒸馏(KD)方法。TinyBERT在经验上是有效的,在GLUE数据集中与BERT可比的结果,同时只有7.x的较小和9.4x速度。 TinBERTAERT在28基准参数中也大大高于基准%。