Transformer-based language models such as BERT have outperformed previous models on a large number of English benchmarks, but their evaluation is often limited to English or a small number of well-resourced languages. In this work, we evaluate monolingual, multilingual, and randomly initialized language models from the BERT family on a variety of Uralic languages including Estonian, Finnish, Hungarian, Erzya, Moksha, Karelian, Livvi, Komi Permyak, Komi Zyrian, Northern S\'ami, and Skolt S\'ami. When monolingual models are available (currently only et, fi, hu), these perform better on their native language, but in general they transfer worse than multilingual models or models of genetically unrelated languages that share the same character set. Remarkably, straightforward transfer of high-resource models, even without special efforts toward hyperparameter optimization, yields what appear to be state of the art POS and NER tools for the minority Uralic languages where there is sufficient data for finetuning.
翻译:BERT等基于变异语言模型在大量英文基准方面优于先前的模型,但其评价通常仅限于英语或少数资源充足的语言。 在这项工作中,我们评估了BERT家族在包括爱沙尼亚语、芬兰语、匈牙利语、埃尔恰语、莫克沙语、卡雷连语、利维语、科米·佩米亚克语、科米齐里安语、北苏米语和斯科尔特S'ami语等各种乌拉利语在内的乌拉利语的单一语言、多语言和随机初始语言模型。 当有单一语言模型(目前仅使用英语或少量资源丰富的语言)时,这些模型在本地语言上表现更好,但一般而言,它们所传播的比多语言模式或具有相同字符集的基因无关语言模型更差。 值得注意的是,高资源模型的简单转换,即使没有特别努力进行超比对称优化,也产生了少数乌拉利语的艺术 POS 和 NER 工具的状态,那里有足够的数据可以进行微调。