Numerical reasoning based machine reading comprehension is a task that involves reading comprehension along with using arithmetic operations such as addition, subtraction, sorting, and counting. The DROP benchmark (Dua et al., 2019) is a recent dataset that has inspired the design of NLP models aimed at solving this task. The current standings of these models in the DROP leaderboard, over standard metrics, suggest that the models have achieved near-human performance. However, does this mean that these models have learned to reason? In this paper, we present a controlled study on some of the top-performing model architectures for the task of numerical reasoning. Our observations suggest that the standard metrics are incapable of measuring progress towards such tasks.
翻译:基于数字推理的机器阅读理解是一项任务,它涉及阅读理解和使用算术操作,例如加法、减法、分拣和计算。DROP基准(Dua等人,2019年)是最近的一个数据集,它激励设计了旨在解决这一问题的NLP模型。这些模型在DROP领导板上的目前状况,超过了标准指标,表明这些模型已经实现了接近人的性能。然而,这是否意味着这些模型已经学会了合理的理由?在本文中,我们对一些最优秀的模型结构进行了控制的研究,以完成数字推理任务。我们的观察表明,这些标准指标无法衡量这些任务的进展。