Multilingual machine translation has attracted much attention recently due to its support of knowledge transfer among languages and the low cost of training and deployment compared with numerous bilingual models. A known challenge of multilingual models is the negative language interference. In order to enhance the translation quality, deeper and wider architectures are applied to multilingual modeling for larger model capacity, which suffers from the increased inference cost at the same time. It has been pointed out in recent studies that parameters shared among languages are the cause of interference while they may also enable positive transfer. Based on these insights, we propose an adaptive and sparse architecture for multilingual modeling, and train the model to learn shared and language-specific parameters to improve the positive transfer and mitigate the interference. The sparse architecture only activates a sub-network which preserves inference efficiency, and the adaptive design selects different sub-networks based on the input languages. Our model outperforms strong baselines across multiple benchmarks. On the large-scale OPUS dataset with $100$ languages, we achieve $+2.1$, $+1.3$ and $+6.2$ BLEU improvements in one-to-many, many-to-one and zero-shot tasks respectively compared to standard Transformer without increasing the inference cost.
翻译:最近,多语文机器翻译由于支持各语文之间的知识转让以及培训和部署费用与众多双语模式相比较低,最近引起许多关注。多语文模式的已知挑战之一是负面语言干扰。为了提高翻译质量,对更大模型能力的多语文模型应用了更深、更宽的架构,这同时也增加了推论成本。在最近的研究中指出,各语文之间共享参数是干扰因素,它们也可能促成积极的转移。根据这些见解,我们提议为多语种建模建立一个适应性和稀少的架构,并培训该模型学习共享和特定语言参数,以改进积极的转移和缓解干扰。稀释架构仅启动一个维护猜想效率的子网络,适应设计则根据输入语言选择不同的子网络。我们的模型超越了多个基准的强大基线。在使用100美元语言的大规模OPUS数据集方面,我们在一对一元制、多对一和零转换标准任务方面实现1美元+2.1美元、1.3美元和6.2美元BLEU的改进,而没有增加成本。