Pre-trained Transformer models such as BERT have shown great success in a wide range of applications, but at the cost of substantial increases in model complexity. Quantization-aware training (QAT) is a promising method to lower the implementation cost and energy consumption. However, aggressive quantization below 2-bit causes considerable accuracy degradation due to unstable convergence, especially when the downstream dataset is not abundant. This work proposes a proactive knowledge distillation method called Teacher Intervention (TI) for fast converging QAT of ultra-low precision pre-trained Transformers. TI intervenes layer-wise signal propagation with the intact signal from the teacher to remove the interference of propagated quantization errors, smoothing loss surface of QAT and expediting the convergence. Furthermore, we propose a gradual intervention mechanism to stabilize the recovery of subsections of Transformer layers from quantization. The proposed schemes enable fast convergence of QAT and improve the model accuracy regardless of the diverse characteristics of downstream fine-tuning tasks. We demonstrate that TI consistently achieves superior accuracy with significantly lower fine-tuning iterations on well-known Transformers of natural language processing as well as computer vision compared to the state-of-the-art QAT methods.
翻译:诸如BERT等经过事先训练的变异模型在广泛的应用中取得了巨大成功,但代价是模型复杂性的大幅提高。量化认知培训(QAT)是降低实施成本和能源消耗的一个很有希望的方法。然而,2位位以下的进取性四分制造成相当的精度退化,原因是交汇情况不稳定,特别是当下游数据集并不丰富时。这项工作提出了一种名为“教师干预”的积极主动的知识蒸馏方法,用于快速融合极低精密的经训练的变异器的快速融合。TI干预与教师的完整信号进行层间信号传播,以消除传播的四分化错误的干扰,平滑QAT的损耗面并加速趋同。此外,我们提议了一个渐进干预机制,以稳定变异层的回收,特别是当下游数据集不丰富时。拟议办法使得QAT能够快速融合,并改进模型准确性,而不论下游微调任务的不同特点。我们证明,TI始终取得了较高的精准性,同时对众所周知的自然语言处理的变异器进行了大幅度的精确调整,作为计算机视野的方法,也比得更精确。