Previous studies have proved that cross-lingual knowledge distillation can significantly improve the performance of pre-trained models for cross-lingual similarity matching tasks. However, the student model needs to be large in this operation. Otherwise, its performance will drop sharply, thus making it impractical to be deployed to memory-limited devices. To address this issue, we delve into cross-lingual knowledge distillation and propose a multi-stage distillation framework for constructing a small-size but high-performance cross-lingual model. In our framework, contrastive learning, bottleneck, and parameter recurrent strategies are combined to prevent performance from being compromised during the compression process. The experimental results demonstrate that our method can compress the size of XLM-R and MiniLM by more than 50\%, while the performance is only reduced by about 1%.
翻译:先前的研究已经证明,跨语言知识蒸馏可以显著改善经过培训的跨语言相似性任务模型的性能。 但是,学生模型在这项操作中需要大得多。 否则,其性能会急剧下降,因此无法将它运用到记忆限制装置中。 为了解决这个问题,我们深入到跨语言知识蒸馏中,并为构建一个小型但高性能跨语言模式提出一个多阶段的蒸馏框架。 在我们的框架里,对比性学习、瓶颈和参数重复性战略会结合在一起,以防止压缩过程中的性能受损。实验结果显示,我们的方法可以压缩XLM-R和MiniLM的体积,而其性能只减少约1%。