Recently, transformer-based language models such as BERT have shown tremendous performance improvement for a range of natural language processing tasks. However, these language models usually are computation expensive and memory intensive during inference. As a result, it is difficult to deploy them on resource-restricted devices. To improve the inference performance, as well as reduce the model size while maintaining the model accuracy, we propose a novel quantization method named KDLSQ-BERT that combines knowledge distillation (KD) with learned step size quantization (LSQ) for language model quantization. The main idea of our method is that the KD technique is leveraged to transfer the knowledge from a "teacher" model to a "student" model when exploiting LSQ to quantize that "student" model during the quantization training process. Extensive experiment results on GLUE benchmark and SQuAD demonstrate that our proposed KDLSQ-BERT not only performs effectively when doing different bit (e.g. 2-bit $\sim$ 8-bit) quantization, but also outperforms the existing BERT quantization methods, and even achieves comparable performance as the full-precision base-line model while obtaining 14.9x compression ratio. Our code will be public available.
翻译:最近,以变压器为基础的语言模型(如BERT)在一系列自然语言处理任务方面表现出了巨大的性能改进。然而,这些语言模型通常在推论期间计算昂贵和记忆密集。因此,很难在资源限制装置上部署这些模型。为了改进推论性能,并在保持模型准确性的同时降低模型大小,我们提议了一个名为KDLSQ-BERT的新颖的量化方法,将知识蒸馏(KDD)与语言模型量化的逐步大小量化(LSQ)相结合。我们方法的主要想法是,KD技术在利用LSQ在量化培训过程中将知识从“教师”模型转移到“学生”模型时,很难将其应用到“学生”模型。关于GLUE基准和SQUAD的广泛实验结果表明,我们提议的KDLSQ-BERT不仅在做不同部分(例如2-bit\sim $8-bit)时有效运行。我们的方法主要是将“教师”技术从“教师”模型转换成“学生”模型,而且还将完成现有的14号标准化,同时将获得现有的公共标准。