The development of over-parameterized pre-trained language models has made a significant contribution toward the success of natural language processing. While over-parameterization of these models is the key to their generalization power, it makes them unsuitable for deployment on low-capacity devices. We push the limits of state-of-the-art Transformer-based pre-trained language model compression using Kronecker decomposition. We use this decomposition for compression of the embedding layer, all linear mappings in the multi-head attention, and the feed-forward network modules in the Transformer layer. We perform intermediate-layer knowledge distillation using the uncompressed model as the teacher to improve the performance of the compressed model. We present our KroneckerBERT, a compressed version of the BERT_BASE model obtained using this framework. We evaluate the performance of KroneckerBERT on well-known NLP benchmarks and show that for a high compression factor of 19 (5% of the size of the BERT_BASE model), our KroneckerBERT outperforms state-of-the-art compression methods on the GLUE. Our experiments indicate that the proposed model has promising out-of-distribution robustness and is superior to the state-of-the-art compression methods on SQuAD.
翻译:虽然这些模型的超度参数化是其一般化能力的关键,但使这些模型不适合在低容量设备上部署。我们用Kronecker分解法推压基于最先进变压器的预先培训语言模型压缩的极限。我们用这种分解法压缩嵌入层、多头目关注的所有线性绘图和变异层的进化前网络模块。我们用未压缩模型来进行中间层知识蒸馏,作为教师来改进压缩模型的性能。我们展示我们的KroneckerBERT,这是利用这个框架获得的BERT_BASEB模型的压缩版。我们用众所周知的NLP基准来评估KroneckerBERT的性能,并显示对于19个高压缩系数(占BERT_BASASE模型规模的5%),我们的KroneckerBERT超越了中层状态,以未压缩模型的形式进行蒸馏,而在GLUE中,我们提出的SMAREM方法是具有前景的。