The central bottleneck for low-resource NLP is typically regarded to be the quantity of accessible data, overlooking the contribution of data quality. This is particularly seen in the development and evaluation of low-resource systems via down sampling of high-resource language data. In this work we investigate the validity of this approach, and we specifically focus on two well-known NLP tasks for our empirical investigations: POS-tagging and machine translation. We show that down sampling from a high-resource language results in datasets with different properties than the low-resource datasets, impacting the model performance for both POS-tagging and machine translation. Based on these results we conclude that naive down sampling of datasets results in a biased view of how well these systems work in a low-resource scenario.
翻译:低资源NLP的核心瓶颈通常被视为可获取数据的数量,忽略了数据质量的贡献。这特别体现在通过对高资源语言数据进行下取样来开发和评估低资源系统。在这项工作中,我们调查了这种方法的有效性,我们特别侧重于我们经验性调查的两个众所周知的NLP任务:POS标签和机器翻译。我们显示,从高资源语言取样的结果是,具有与低资源数据集不同特性的数据集,影响POS标签和机器翻译的模型性能。基于这些结果,我们得出结论认为,对数据集进行天真的下抽样的结果是,对这些系统在低资源情景下如何运作有偏颇的看法。