Existing techniques often attempt to make knowledge transfer from a powerful machine translation (MT) to speech translation (ST) model with some elaborate techniques, which often requires transcription as extra input during training. However, transcriptions are not always available, and how to improve the ST model performance without transcription, i.e., data efficiency, has rarely been studied in the literature. In this paper, we propose Decoupled Non-parametric Knowledge Distillation (DNKD) from data perspective to improve the data efficiency. Our method follows the knowledge distillation paradigm. However, instead of obtaining the teacher distribution from a sophisticated MT model, we construct it from a non-parametric datastore via k-Nearest-Neighbor (kNN) retrieval, which removes the dependence on transcription and MT model. Then we decouple the classic knowledge distillation loss into target and non-target distillation to enhance the effect of the knowledge among non-target logits, which is the prominent "dark knowledge". Experiments on MuST-C corpus show that, the proposed method can achieve consistent improvement over the strong baseline without requiring any transcription.
翻译:现有技术通常尝试使用一些精细的技术将强大的机器翻译(MT)模型的知识转移至语音翻译(ST)模型,这往往需要在训练期间提供转录作为额外输入。然而,并非总是可用转录,并且如何在没有转录的情况下提高ST模型性能即数据效率很少在文献中研究。本文提出一种从数据角度出发的非参数解耦知识蒸馏(DNKD)以提高数据效率。我们的方法遵循知识蒸馏范例。但是,我们没有从复杂的MT模型中获取教师分布,而是通过k-最近邻(kNN)检索从非参数数据存储库构建它,从而去除了对转录和MT模型的依赖性。然后,我们将经典的知识蒸馏损失解耦为目标和非目标蒸馏,以增强非目标logit之间的知识效果,这是显著的“暗知识”。在MuST-C语料库上的实验表明,所提出的方法可以在不需要任何转录的情况下实现相对较为一致的强基线改进。