Crowdsourcing is widely used to create data for common natural language understanding tasks. Despite the importance of these datasets for measuring and refining model understanding of language, there has been little focus on the crowdsourcing methods used for collecting the datasets. In this paper, we compare the efficacy of interventions that have been proposed in prior work as ways of improving data quality. We use multiple-choice question answering as a testbed and run a randomized trial by assigning crowdworkers to write questions under one of four different data collection protocols. We find that asking workers to write explanations for their examples is an ineffective stand-alone strategy for boosting NLU example difficulty. However, we find that training crowdworkers, and then using an iterative process of collecting data, sending feedback, and qualifying workers based on expert judgments is an effective means of collecting challenging data. But using crowdsourced, instead of expert judgments, to qualify workers and send feedback does not prove to be effective. We observe that the data from the iterative protocol with expert assessments is more challenging by several measures. Notably, the human--model gap on the unanimous agreement portion of this data is, on average, twice as large as the gap for the baseline protocol data.
翻译:尽管这些数据集对于衡量和完善对语言的示范理解十分重要,但对收集数据集所使用的众包方法却没有多少重视。在本文中,我们比较了先前工作中提出的干预措施的效力,以此作为提高数据质量的有效手段。我们使用多种选择问题作为测试台,通过指派众组工人在四种不同的数据收集协议中的一种协议下写问题进行随机试验。我们发现,要求工人为其案例写解释是提高NLU实例难度的一种无效的独立战略。然而,我们发现,培训众组工人,然后根据专家判断使用数据收集、发送反馈和合格工人的迭接程序,是收集具有挑战性的数据的有效手段。但是,利用众包而不是专家的判断,对工人进行资格认证和发送反馈并不有效。我们发现,由专家评估的迭接协议提供的数据由于若干措施而更具挑战性。值得注意的是,关于这一数据一致协议部分的人类模型差距平均是基线协议差距的两倍。