Multilingual pretrained models are effective for machine translation and cross-lingual processing because they contain multiple languages in one model. However, they are pretrained after their tokenizers are fixed; therefore it is difficult to change the vocabulary after pretraining. When we extend the pretrained models to new languages, we must modify the tokenizers simultaneously. In this paper, we add new subwords to the SentencePiece tokenizer to apply a multilingual pretrained model to new languages (Inuktitut in this paper). In our experiments, we segmented Inuktitut sentences into subwords without changing the segmentation of already pretrained languages, and applied the mBART-50 pretrained model to English-Inuktitut translation.
翻译:多语言预先培训模式对于机器翻译和跨语言处理是有效的,因为它们包含一种模式中的多种语言。然而,它们经过预先培训,在标识器固定后,因此难以在预培训后更改词汇。当我们将预培训模式扩大到新语言时,我们必须同时修改符号。在本文中,我们在句式Piece象征性产品中添加新的小字,对新语言应用多语言预培训模式(本文中的Inuktitut )。在我们的实验中,我们将Inuktitut 句分为子字,而不改变预培训语言的分解,并将MBART-50预培训模式应用于英语-Inuktititut翻译。