满足低资源语言需求:通过预先培训模式自动调整的价值 (Meeting the Needs of Low-Resource Languages: The Value of Automatic Alignments via Pretrained Models)

Large multilingual models have inspired a new class of word alignment methods, which work well for the model's pretraining languages. However, the languages most in need of automatic alignment are low-resource and, thus, not typically included in the pretraining data. In this work, we ask: How do modern aligners perform on unseen languages, and are they better than traditional methods? We contribute gold-standard alignments for Bribri--Spanish, Guarani--Spanish, Quechua--Spanish, and Shipibo-Konibo--Spanish. With these, we evaluate state-of-the-art aligners with and without model adaptation to the target language. Finally, we also evaluate the resulting alignments extrinsically through two downstream tasks: named entity recognition and part-of-speech tagging. We find that although transformer-based methods generally outperform traditional models, the two classes of approach remain competitive with each other.

翻译：大型多语种模式激发了新型的单词调整方法,对于模型的预培训语言效果良好,然而,最需要自动调整的语文是低资源语言,因此通常不包含在预培训数据中。在这项工作中,我们问:现代匹配者如何在看不见的语言上运作,它们比传统方法更好吗?我们为布里布里-西班牙语、瓜拉尼-西班牙语、克丘亚-西班牙语和西普里博-科尼博-西班牙语提供了金质标准调整方法。通过这些方法,我们评估最先进的匹配者是否与目标语言相适应,而不对目标语言进行模型调整。最后,我们还通过两个下游任务(名称实体识别和部分语音标记)对由此产生的匹配进行扩展性评估。我们发现,尽管基于变异器的方法通常优于传统模式,但这两种方法仍然相互竞争。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

80+阅读 · 2020年7月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日