Knowledge distillation is one of the primary methods of transferring knowledge from large to small models. However, it requires massive task-specific data, which may not be plausible in many real-world applications. Data augmentation methods such as representation interpolation, token replacement, or augmentation with models are applied to tackle this problem. However, these data augmentation methods either potentially cause shifts in decision boundaries (representation interpolation), are not expressive enough (token replacement), or introduce too much computational overhead (augmentation with models). To this end, we propose AugPro (Augmentation with Projection), an effective and efficient data augmentation method for distillation. Our method builds on top of representation interpolation augmentation methods to maintain the diversity of expressions and converts the augmented data to tokens to avoid shifting decision boundaries. It uses simple operations that come with little computational overhead. The results on multiple GLUE tasks show that our methods can improve distillation performance by a large margin at a low time cost.
翻译:知识蒸馏是将知识从大模型向小模型转移的主要方法之一。然而,它需要大量的任务特定数据,这在许多现实应用中可能不可信。数据增强方法,如代表内插、代用或增加模型等,用于解决这一问题。然而,这些数据增强方法要么可能造成决策界限的变化(代用内插),要么表达不够(代用替代),或者引入过多的计算间接费用(用模型加增)。为此,我们提议采用AugPro(用投影加增),一种高效益和高效率的数据增强法,用于蒸馏。我们的方法基于代表内插增强方法的顶部,以维持表达方式的多样性,并将扩大的数据转换为符号,以避免改变决定界限。它使用简单的操作,而很少计算间接。多个GLUE任务的结果显示,我们的方法可以在低成本下以大利润提高蒸馏性。