While diverse question answering (QA) datasets have been proposed and contributed significantly to the development of deep learning models for QA tasks, the existing datasets fall short in two aspects. First, we lack QA datasets covering complex questions that involve answers as well as the reasoning processes to get the answers. As a result, the state-of-the-art QA research on numerical reasoning still focuses on simple calculations and does not provide the mathematical expressions or evidences justifying the answers. Second, the QA community has contributed much effort to improving the interpretability of QA models. However, these models fail to explicitly show the reasoning process, such as the evidence order for reasoning and the interactions between different pieces of evidence. To address the above shortcomings, we introduce NOAHQA, a conversational and bilingual QA dataset with questions requiring numerical reasoning with compound mathematical expressions. With NOAHQA, we develop an interpretable reasoning graph as well as the appropriate evaluation metric to measure the answer quality. We evaluate the state-of-the-art QA models trained using existing QA datasets on NOAHQA and show that the best among them can only achieve 55.5 exact match scores, while the human performance is 89.7. We also present a new QA model for generating a reasoning graph where the reasoning graph metric still has a large gap compared with that of humans, e.g., 28 scores.
翻译:虽然提出了不同的回答问题(QA)数据集,并大大促进了为QA任务开发深层次学习模型,但现有数据集在两个方面都存在缺陷。首先,我们缺乏包含复杂问题的质量保证数据集,涉及答案以及获取答案的推理过程。因此,关于数字推理的先进QA研究仍然侧重于简单的计算,没有提供答案的数学表达方式或证据。第二,QA社区为改进QA模型的可解释性作出了很大贡献。然而,这些模型未能明确显示推理过程,例如推理的证据顺序和不同证据之间相互作用。为了解决上述缺陷,我们引入NOAHQA,即一个谈话和双语的质量保证数据集,需要数字推理和复合数学表达的问题。与NOAQAQA一道,我们开发了一个可解释的推理图表,以及用来衡量答案质量的适当评价尺度。我们评估了利用现有QA数据集所培训的高质量QA模型,但是,这些模型未能明确显示推理,而我们又能够比对 NOAQA 5 图表作出最佳的推理。