Transfer learning based on pretraining language models on a large amount of raw data has become a new norm to reach state-of-the-art performance in NLP. Still, it remains unclear how this approach should be applied for unseen languages that are not covered by any available large-scale multilingual language model and for which only a small amount of raw data is generally available. In this work, by comparing multilingual and monolingual models, we show that such models behave in multiple ways on unseen languages. Some languages greatly benefit from transfer learning and behave similarly to closely related high resource languages whereas others apparently do not. Focusing on the latter, we show that this failure to transfer is largely related to the impact of the script used to write such languages. Transliterating those languages improves very significantly the ability of large-scale multilingual language models on downstream tasks.
翻译:在大量原始数据的培训前语言模式基础上的转让学习模式已成为实现国家语言方案最新业绩的新规范。然而,仍然不清楚该方法应如何适用于任何现有大规模多语种语言模式所没有覆盖的、一般只有少量原始数据的无形语言。在这项工作中,通过比较多语种和单一语言模式,我们表明这些模式以多种方式对待隐性语言。有些语言从传授学习模式中受益匪浅,与密切相关的高资源语言相似,而另一些语言则明显没有。我们以后者为重点,我们表明这种不转让在很大程度上与用于编写这类语言的脚本的影响有关。这些语言的传播极大地提高了大型多语种语言模式在下游任务上的能力。