With multilingual machine translation (MMT) models continuing to grow in size and number of supported languages, it is natural to reuse and upgrade existing models to save computation as data becomes available in more languages. However, adding new languages requires updating the vocabulary, which complicates the reuse of embeddings. The question of how to reuse existing models while also making architectural changes to provide capacity for both old and new languages has also not been closely studied. In this work, we introduce three techniques that help speed up effective learning of the new languages and alleviate catastrophic forgetting despite vocabulary and architecture mismatches. Our results show that by (1) carefully initializing the network, (2) applying learning rate scaling, and (3) performing data up-sampling, it is possible to exceed the performance of a same-sized baseline model with 30% computation and recover the performance of a larger model trained from scratch with over 50% reduction in computation. Furthermore, our analysis reveals that the introduced techniques help learn the new directions more effectively and alleviate catastrophic forgetting at the same time. We hope our work will guide research into more efficient approaches to growing languages for these MMT models and ultimately maximize the reuse of existing models.
翻译:由于多语种机器翻译(MMT)模式的规模和数量在不断增长,因此,随着数据以更多语言提供,重新使用和升级现有模式以节省计算是自然的。不过,增加新语言需要更新词汇,使嵌入器的再利用复杂化。目前也没有仔细研究如何重新使用现有模型,同时进行建筑变革,为新老语言提供能力的问题。在这项工作中,我们引进了三种技术,帮助加快新语言的有效学习,缓解尽管词汇和结构不匹配的灾难性遗忘。我们的结果显示:(1) 仔细启动网络,(2) 应用学习速度的扩大,(3) 进行数据更新,有可能超过同一规模基线模型的性能,计算30%,并恢复从零开始训练的较大模型的性能,同时减少50%以上。此外,我们的分析表明,引进的技术有助于更有效地学习新方向,并同时减轻灾难性的遗忘。我们希望我们的工作将指导研究工作,以更有效的方法发展这些MMT模型的语文,并最终使现有模型的再利用最大化。