Pretrained multilingual contextual representations have shown great success, but due to the limits of their pretraining data, their benefits do not apply equally to all language varieties. This presents a challenge for language varieties unfamiliar to these models, whose labeled \emph{and unlabeled} data is too limited to train a monolingual model effectively. We propose the use of additional language-specific pretraining and vocabulary augmentation to adapt multilingual models to low-resource settings. Using dependency parsing of four diverse low-resource language varieties as a case study, we show that these methods significantly improve performance over baselines, especially in the lowest-resource cases, and demonstrate the importance of the relationship between such models' pretraining data and target language varieties.
翻译:未经培训的多语种背景介绍显示取得了巨大成功,但由于他们培训前的数据有限,它们的好处并不平等地适用于所有语文种类,这对这些模式不熟悉的语言品种提出了挑战,因为其标签为 emph{和无标签的}数据太有限,无法有效培训单一语言模式。我们提议使用额外的语言专门培训和词汇扩增,以适应低资源环境的多语种模式。我们用对四种不同的低资源语言品种的依赖性分析作为案例研究,我们表明这些方法大大改善了基线的绩效,特别是在最低资源情况下,并表明了这些模式的培训前数据和目标语言品种之间的关系的重要性。