Multilingual language models have pushed state-of-the-art in cross-lingual NLP transfer. The majority of zero-shot cross-lingual transfer, however, use one and the same massively multilingual transformer (e.g., mBERT or XLM-R) to transfer to all target languages, irrespective of their typological, etymological, and phylogenetic relations to other languages. In particular, readily available data and models of resource-rich sibling languages are often ignored. In this work, we empirically show, in a case study for Faroese -- a low-resource language from a high-resource language family -- that by leveraging the phylogenetic information and departing from the 'one-size-fits-all' paradigm, one can improve cross-lingual transfer to low-resource languages. In particular, we leverage abundant resources of other Scandinavian languages (i.e., Danish, Norwegian, Swedish, and Icelandic) for the benefit of Faroese. Our evaluation results show that we can substantially improve the transfer performance to Faroese by exploiting data and models of closely-related high-resource languages. Further, we release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS), and new language models trained on all Scandinavian languages.
翻译:多语言语言模型推进了跨语言自然语言处理的最新进展。然而,大多数零翻译跨语言迁移使用同一种大规模多语言变压器(例如mBERT或XLM-R),以将所有目标语言进行转移,而不管它们与其他语言的语言学、词源学和系统发育关系如何。特别是,经常忽略可用的同源语言的数据和模型。在这项工作中,我们在法罗语的案例研究中实证表明,通过利用这种分类信息,并远离“一刀切”范式,我们可以改进跨语言迁移到低资源语言。具体而言,我们利用其他斯堪的纳维亚语言(即丹麦语、挪威语、瑞典语和冰岛语)的丰富资源,使法罗语受益。我们的评估结果显示,通过利用密切相关的高资源语言的数据和模型,我们可以大大提高到法罗语的转移性能。此外,我们发布了法罗语网络语料库以及用于命名实体识别(NER)、语义文本相似性(STS)的法罗语数据集,并在所有斯堪的纳维亚语言上进行了新语言模型的训练。