Researchers have traditionally recruited native speakers to provide annotations for the widely used benchmark datasets. But there are languages for which recruiting native speakers is difficult, and it would help to get learners of those languages to annotate the data. In this paper, we investigate whether language learners can contribute annotations to the benchmark datasets. In a carefully controlled annotation experiment, we recruit 36 language learners, provide two types of additional resources (dictionaries and machine-translated sentences), and perform mini-tests to measure their language proficiency. We target three languages, English, Korean, and Indonesian, and four NLP tasks, sentiment analysis, natural language inference, named entity recognition, and machine reading comprehension. We find that language learners, especially those with intermediate or advanced language proficiency, are able to provide fairly accurate labels with the help of additional resources. Moreover, we show that data annotation improves learners' language proficiency in terms of vocabulary and grammar. The implication of our findings is that broadening the annotation task to include language learners can open up the opportunity to build benchmark datasets for languages for which it is difficult to recruit native speakers.
翻译:传统上,研究人员聘用本地语言者为广泛使用的基准数据集提供说明。但有些语言很难招聘本地语言者,有助于让这些语言的学习者对数据作出说明。在本文中,我们调查语言学习者能否为基准数据集提供说明。在仔细控制的批注实验中,我们征聘了36名语言学习者,提供了两种额外的资源(词典和翻译的句子),并进行了衡量语言熟练程度的小型测试。我们针对三种语言,即英语、韩语和印尼语,以及四种国家语言方案任务、情绪分析、自然语言推论、名称实体识别和机器阅读理解。我们发现语言学习者,特别是具有中级或高级语言熟练程度的语言学习者,能够在额外资源的帮助下提供相当准确的标签。此外,我们显示,数据注解提高了学习者在词汇和语法方面语言熟练程度。我们发现,扩大说明任务范围,将语言学习者包括在内,可以打开为难以招聘母语的语言建立基准数据集的机会。