As pre-trained language models become more resource-demanding, the inequality between resource-rich languages such as English and resource-scarce languages is worsening. This can be attributed to the fact that the amount of available training data in each language follows the power-law distribution, and most of the languages belong to the long tail of the distribution. Some research areas attempt to mitigate this problem. For example, in cross-lingual transfer learning and multilingual training, the goal is to benefit long-tail languages via the knowledge acquired from resource-rich languages. Although being successful, existing work has mainly focused on experimenting on as many languages as possible. As a result, targeted in-depth analysis is mostly absent. In this study, we focus on a single low-resource language and perform extensive evaluation and probing experiments using cross-lingual post-training (XPT). To make the transfer scenario challenging, we choose Korean as the target language, as it is a language isolate and thus shares almost no typology with English. Results show that XPT not only outperforms or performs on par with monolingual models trained with orders of magnitudes more data but also is highly efficient in the transfer process.
翻译:由于经过培训的语文模式越来越需要资源,资源丰富的语文(如英文)和资源稀缺语文之间的不平等正在加剧,这可以归因于以下事实:每种语文的现有培训数据数量随权力法的分发而变化,而且大多数语文都属于分发过程的长尾。一些研究领域试图缓解这一问题。例如,在跨语言转让学习和多语种培训方面,目标是通过从资源丰富的语文获得的知识使长尾语文受益。尽管这项工作很成功,但目前的工作主要侧重于尽可能多的语文的试验。因此,大部分缺乏有针对性的深入分析。在本研究中,我们侧重于一种单一的低资源语言,并利用跨语言培训后的培训(XPT)进行广泛的评估和测试。为了使传输情景具有挑战性,我们选择韩语作为目标语言,因为它是一种语言孤立,因此与英语几乎没有类型关系。结果显示,XPT不仅超越或表现了受过数量级数据培训的单语种模式,而且在传输过程中也非常高效。