Compared to conventional bilingual translation systems, massively multilingual machine translation is appealing because a single model can translate into multiple languages and benefit from knowledge transfer for low resource languages. On the other hand, massively multilingual models suffer from the curse of multilinguality, unless scaling their size massively, which increases their training and inference costs. Sparse Mixture-of-Experts models are a way to drastically increase model capacity without the need for a proportional amount of computing. The recently released NLLB-200 is an example of such a model. It covers 202 languages but requires at least four 32GB GPUs just for inference. In this work, we propose a pruning method that allows the removal of up to 80\% of experts with a negligible loss in translation quality, which makes it feasible to run the model on a single 32GB GPU. Further analysis suggests that our pruning metrics allow to identify language-specific experts and prune non-relevant experts for a given language pair.
翻译:与传统的双语翻译系统相比,大规模多语种机器翻译之所以具有吸引力,是因为单一模式可以翻译成多种语言,并受益于低资源语言的知识转让。 另一方面,大量多语种模式受到多语种的诅咒,除非大规模扩大规模,从而增加其培训和推论成本。 Sprassy Mixture-of-Experters模型是大幅提高模型能力而无需按比例计算数量的一种方式。最近发布的NLLB-200模型就是这种模式的一个例子。它覆盖202种语言,但至少需要4 32GB GPUs,仅供参考。在这项工作中,我们提议了一种支管方法,可以将翻译质量损失微不足道的多达80 ⁇ 专家除名,这样就可以在单一的32GB GPU上运行模型。 进一步的分析表明,我们的调试指标可以确定特定语言的专家和特定语言对应语言的非相关专家。