Evaluation for many natural language understanding (NLU) tasks is broken: Unreliable and biased systems score so highly on standard benchmarks that there is little room for researchers who develop better systems to demonstrate their improvements. The recent trend to abandon IID benchmarks in favor of adversarially-constructed, out-of-distribution test sets ensures that current models will perform poorly, but ultimately only obscures the abilities that we want our benchmarks to measure. In this position paper, we lay out four criteria that we argue NLU benchmarks should meet. We argue most current benchmarks fail at these criteria, and that adversarial data collection does not meaningfully address the causes of these failures. Instead, restoring a healthy evaluation ecosystem will require significant progress in the design of benchmark datasets, the reliability with which they are annotated, their size, and the ways they handle social bias.
翻译:对许多自然语言理解(NLU)任务的评价被打破:在标准基准上,不可靠和有偏向的系统得分很高,以致研究人员发展更好的系统以展示其改进之处的空间很小。 最近的趋势是放弃ID基准,转而采用对抗性构筑的超出分配的测试,这确保了目前的模型运行不善,但最终只能掩盖我们想要衡量基准的能力。在本立场文件中,我们列出了我们争论NLU基准应该达到的四项标准。我们争论的是,目前大多数基准在这些标准上都失败了,而对抗性数据收集并没有有意义地解决这些失败的原因。 相反,恢复健康的生态系统将需要在基准数据集的设计、其注释的可靠性、其大小以及它们处理社会偏见的方式方面取得重大进展。