We show that unsupervised sequence-segmentation performance can be transferred to extremely low-resource languages by pre-training a Masked Segmental Language Model (Downey et al., 2021) multilingually. Further, we show that this transfer can be achieved by training over a collection of low-resource languages that are typologically similar (but phylogenetically unrelated) to the target language. In our experiments, we transfer from a collection of 10 Indigenous American languages (AmericasNLP, Mager et al., 2021) to K'iche', a Mayan language. We compare our model to a monolingual baseline, and show that the multilingual pre-trained approach yields much more consistent segmentation quality across target dataset sizes, including a zero-shot performance of 20.6 F1, and exceeds the monolingual performance in 9/10 experimental settings. These results have promising implications for low-resource NLP pipelines involving human-like linguistic units, such as the sparse transcription framework proposed by Bird (2020).
翻译:我们通过多语种培训隐蔽的局部语言模型(Downey等人,2021年),表明可以把未经监督的序列分层性能转换到极其低的资源语言上。此外,我们证明,通过对一系列在类型上与目标语言相似(但生理上与目标语言无关)的低资源语言进行培训,可以实现这一转移。在我们的实验中,我们把10种美国土著语言(AmericasNLP, Mager等人,2021年)的集成转移到马雅语的K'iche'。我们比较了我们的模型和单一语言基线,并表明多语言的预培训方法在目标数据集大小之间产生更加一致的分层性质量,包括20.6F1的零发性能,超过9/10实验环境中的单语性能。这些结果对涉及类似语言单位的低资源NLP管道,如Bird(2020年)提议的稀有的文字框架,有潜在影响。