Contextualized word representations from pretrained multilingual language models have become the de facto standard for addressing natural language tasks in many different languages, but the success of this approach is far from universal. For languages rarely or never seen by these models, directly using such models often results in suboptimal representation or use of data, motivating additional model adaptations to achieve reasonably strong performance. In this work, we study the performance, extensibility, and interaction of two such adaptations for this low-resource setting: vocabulary augmentation and script transliteration. Our evaluations on a set of three tasks in nine diverse low-resource languages yield a mixed result, upholding the viability of these approaches while raising new questions around how to optimally adapt multilingual models to low-resource settings.
翻译:事先经过训练的多语文模式的内涵表达方式已成为处理许多不同语言的自然语言任务的实际标准,但这一方法的成功远非普遍。对于这些模式很少或从未看到过的语言而言,直接使用这些模式往往导致数据表述或使用不够理想,从而促使更多的模式调整,以取得相当强的业绩。在这项工作中,我们研究了两种这种适应方式的绩效、可扩展性和互动性,以适应这种低资源环境:词汇扩增和文字翻写。我们对一套三种不同低资源语言的三项任务的评价产生了好坏参半的结果,维护了这些模式的可行性,同时提出了关于如何以最佳方式使多语文模式适应低资源环境的新问题。