Knowledge distillation (KD) has been a ubiquitous method for model compression to strengthen the capability of a lightweight model with the transferred knowledge from the teacher. In particular, KD has been employed in quantization-aware training (QAT) of Transformer encoders like BERT to improve the accuracy of the student model with the reduced-precision weight parameters. However, little is understood about which of the various KD approaches best fits the QAT of Transformers. In this work, we provide an in-depth analysis of the mechanism of KD on attention recovery of quantized large Transformers. In particular, we reveal that the previously adopted MSE loss on the attention score is insufficient for recovering the self-attention information. Therefore, we propose two KD methods; attention-map and attention-output losses. Furthermore, we explore the unification of both losses to address task-dependent preference between attention-map and output losses. The experimental results on various Transformer encoder models demonstrate that the proposed KD methods achieve state-of-the-art accuracy for QAT with sub-2-bit weight quantization.
翻译:知识蒸馏(KD)是典型压缩的一种普遍方法,用教师传授的知识加强轻量模型的能力;特别是,KD一直用于BERT等变异器编码器的量化-认知培训(QAT),以提高学生模型的准确性,采用降低精度的重量参数;然而,对于各种KD方法中哪一种方法最适合变异器的QAT,我们很少了解。在这项工作中,我们深入分析了KD关于量化大型变异器的注意恢复机制。特别是,我们发现,先前在注意分上采用的MSE损失不足以恢复自留信息。因此,我们建议了两种KD方法;注意-映和注意-输出损失。此外,我们探索这两种损失的统一,以解决视任务而定的偏好,即注意-映射和输出损失。各种变异器编码模型的实验结果表明,拟议的KD方法在以子二位重方位方位方位平方位化的QAT中达到了状态的精确度。