Knowledge distillation (KD) is an efficient framework for compressing large-scale pre-trained language models. Recent years have seen a surge of research aiming to improve KD by leveraging Contrastive Learning, Intermediate Layer Distillation, Data Augmentation, and Adversarial Training. In this work, we propose a learning based data augmentation technique tailored for knowledge distillation, called CILDA. To the best of our knowledge, this is the first time that intermediate layer representations of the main task are used in improving the quality of augmented samples. More precisely, we introduce an augmentation technique for KD based on intermediate layer matching using contrastive loss to improve masked adversarial data augmentation. CILDA outperforms existing state-of-the-art KD approaches on the GLUE benchmark, as well as in an out-of-domain evaluation.
翻译:知识蒸馏(KD)是压缩大规模预先培训语言模型的有效框架。 近些年来,通过利用对比学习、中间层蒸馏、数据增强和反向培训来改进KD的研究激增。 在这项工作中,我们提议了一种以学习为基础的数据增强技术,专门为知识蒸馏而设计,称为CILDA。 据我们所知,这是第一次在改进强化样品质量方面使用主要任务的中间层表示方式。更确切地说,我们采用了一种以中间层匹配为基础的KD增强技术,利用对比性损失来改进蒙面对称数据增强。 CLLUE 基准上的现有最先进的KD方法,以及场外评价。