Knowledge distillation (KD) is a well-known method for compressing neural models. However, works focusing on distilling knowledge from large multilingual neural machine translation (MNMT) models into smaller ones are practically nonexistent, despite the popularity and superiority of MNMT. This paper bridges this gap by presenting an empirical investigation of knowledge distillation for compressing MNMT models. We take Indic to English translation as a case study and demonstrate that commonly used language-agnostic and language-aware KD approaches yield models that are 4-5x smaller but also suffer from performance drops of up to 3.5 BLEU. To mitigate this, we then experiment with design considerations such as shallower versus deeper models, heavy parameter sharing, multi-stage training, and adapters. We observe that deeper compact models tend to be as good as shallower non-compact ones, and that fine-tuning a distilled model on a High-Quality subset slightly boosts translation quality. Overall, we conclude that compressing MNMT models via KD is challenging, indicating immense scope for further research.
翻译:知识蒸馏(KD)是一种压缩神经模型的常用方法。然而,尽管多语言神经机器翻译(MNMT)的流行和优越性,但关于将来自大型MNMT模型的知识蒸馏到较小模型中的研究实际上不存在。本文通过提供一项实证研究,弥合了这一差距。我们以Indic翻译英语为例,并展示了通常使用的语言不可知和语言感知的知识蒸馏方法产生的模型,尽管体积缩小了4-5倍,但性能下降了最高3.5 BLEU。为了缓解这一问题,我们实验了浅层与深层模型、重参数共享、多阶段训练和适配器等设计考虑因素。我们观察到,深度紧凑模型往往与浅层而非紧凑模型一样好,并且在高质量子集上对蒸馏模型进行微调会略微提高翻译质量。总之,我们得出结论:通过KD压缩MNMT模型具有挑战性,表明进一步研究的范围广泛。