In this paper, we revisit the solving bias when evaluating models on current Math Word Problem (MWP) benchmarks. However, current solvers exist solving bias which consists of data bias and learning bias due to biased dataset and improper training strategy. Our experiments verify MWP solvers are easy to be biased by the biased training datasets which do not cover diverse questions for each problem narrative of all MWPs, thus a solver can only learn shallow heuristics rather than deep semantics for understanding problems. Besides, an MWP can be naturally solved by multiple equivalent equations while current datasets take only one of the equivalent equations as ground truth, forcing the model to match the labeled ground truth and ignoring other equivalent equations. Here, we first introduce a novel MWP dataset named UnbiasedMWP which is constructed by varying the grounded expressions in our collected data and annotating them with corresponding multiple new questions manually. Then, to further mitigate learning bias, we propose a Dynamic Target Selection (DTS) Strategy to dynamically select more suitable target expressions according to the longest prefix match between the current model output and candidate equivalent equations which are obtained by applying commutative law during training. The results show that our UnbiasedMWP has significantly fewer biases than its original data and other datasets, posing a promising benchmark for fairly evaluating the solvers' reasoning skills rather than matching nearest neighbors. And the solvers trained with our DTS achieve higher accuracies on multiple MWP benchmarks. The source code is available at https://github.com/yangzhch6/UnbiasedMWP.
翻译:在本文中,我们在评估当前数学文字问题(MWP)基准的模型时重新审视解决偏差的偏差。 但是,当前的解决方案存在解决偏差的偏差, 包括数据偏差和学习偏差, 以及由于偏差的数据集和不适当的培训战略。 我们的实验核查MWP的解答很容易被偏差的培训数据集所偏差的数据集所偏差, 这些数据集没有涵盖所有数学主题的每个问题叙事的不同问题, 因此, 解决问题者只能学习浅的偏差, 而不是理解问题的深语义。 此外, 我们提出一个动态目标选择( DTS) 战略, 可以通过多个等同的方程式来自然解决。 而当前数据集只将一个等同的方程式作为地面真相, 迫使模型匹配贴上标签的地面真相, 并忽略其他等同的方程式。 在这里,我们首先推出一个名为 Unbias MS 的新型数据集, 其最有希望的排序的数据在最晚的源码中被应用。