To explain the predicted answers and evaluate the reasoning abilities of models, several studies have utilized underlying reasoning (UR) tasks in multi-hop question answering (QA) datasets. However, it remains an open question as to how effective UR tasks are for the QA task when training models on both tasks in an end-to-end manner. In this study, we address this question by analyzing the effectiveness of UR tasks (including both sentence-level and entity-level tasks) in three aspects: (1) QA performance, (2) reasoning shortcuts, and (3) robustness. While the previous models have not been explicitly trained on an entity-level reasoning prediction task, we build a multi-task model that performs three tasks together: sentence-level supporting facts prediction, entity-level reasoning prediction, and answer prediction. Experimental results on 2WikiMultiHopQA and HotpotQA-small datasets reveal that (1) UR tasks can improve QA performance. Using four debiased datasets that are newly created, we demonstrate that (2) UR tasks are helpful in preventing reasoning shortcuts in the multi-hop QA task. However, we find that (3) UR tasks do not contribute to improving the robustness of the model on adversarial questions, such as sub-questions and inverted questions. We encourage future studies to investigate the effectiveness of entity-level reasoning in the form of natural language questions (e.g., sub-question forms).
翻译:为了解释预测的答案并评价模型的推理能力,一些研究利用了多点答题(QA)数据集中的基本推理(UR)任务,然而,当以端对端方式对两个任务的培训模式进行培训时,对于质量评估任务而言,乌拉圭回合任务的效力如何?在本研究中,我们通过分析UR任务(包括判决一级和实体一级的任务)在三个方面的有效性来解决这个问题:(1)质量评估业绩,(2)推理捷径和(3)稳健性。虽然以前的模型没有在实体一级推理预测任务方面受过明确培训,但我们仍建立一个多任务模式,共同执行三项任务:判决一级支持事实预测,实体一级推理预测和回答预测。2WikiMultiHopA和HotpotQA小数据集的实验结果显示:(1) 乌拉圭回合任务可以改进质量评估业绩。利用新创建的四个不偏差的数据集,我们发现,(2) 乌拉圭回合任务有助于防止多点推理的推理在多点、度、实体一级预测和回答后期研究中出现强度任务。