Large multilingual models have inspired a new class of word alignment methods, which work well for the model's pretraining languages. However, the languages most in need of automatic alignment are low-resource and, thus, not typically included in the pretraining data. In this work, we ask: How do modern aligners perform on unseen languages, and are they better than traditional methods? We contribute gold-standard alignments for Bribri--Spanish, Guarani--Spanish, Quechua--Spanish, and Shipibo-Konibo--Spanish. With these, we evaluate state-of-the-art aligners with and without model adaptation to the target language. Finally, we also evaluate the resulting alignments extrinsically through two downstream tasks: named entity recognition and part-of-speech tagging. We find that although transformer-based methods generally outperform traditional models, the two classes of approach remain competitive with each other.
翻译:大型多语种模式激发了新型的单词调整方法,对于模型的预培训语言效果良好,然而,最需要自动调整的语文是低资源语言,因此通常不包含在预培训数据中。在这项工作中,我们问:现代匹配者如何在看不见的语言上运作,它们比传统方法更好吗?我们为布里布里-西班牙语、瓜拉尼-西班牙语、克丘亚-西班牙语和西普里博-科尼博-西班牙语提供了金质标准调整方法。通过这些方法,我们评估最先进的匹配者是否与目标语言相适应,而不对目标语言进行模型调整。最后,我们还通过两个下游任务(名称实体识别和部分语音标记)对由此产生的匹配进行扩展性评估。我们发现,尽管基于变异器的方法通常优于传统模式,但这两种方法仍然相互竞争。