To build robust question answering systems, we need the ability to verify whether answers to questions are truly correct, not just "good enough" in the context of imperfect QA datasets. We explore the use of natural language inference (NLI) as a way to achieve this goal, as NLI inherently requires the premise (document context) to contain all necessary information to support the hypothesis (proposed answer to the question). We leverage large pre-trained models and recent prior datasets to construct powerful question converter and decontextualization modules, which can reformulate QA instances as premise-hypothesis pairs with very high reliability. Then, by combining standard NLI datasets with NLI examples automatically derived from QA training data, we can train NLI models to judge the correctness of QA models' proposed answers. We show that our NLI approach can generally improve the confidence estimation of a QA model across different domains, evaluated in a selective QA setting. Careful manual analysis over the predictions of our NLI model shows that it can further identify cases where the QA model produces the right answer for the wrong reason, or where the answer cannot be verified as addressing all aspects of the question.
翻译:为了建立强有力的回答问题系统,我们需要能够核实对问题的回答是否真正正确,而不仅仅是在不完善的质量保证数据集背景下“足够好”的问题答案。我们探索使用自然语言推断(NLI)作为实现这一目标的一种方法,因为国家扫盲指数本身需要包含一切必要信息以支持假设(对问题的拟议回答)的前提(文件背景)。我们利用大量预先培训的模型和最近的先前数据集来构建强大的问题转换器和脱脂化模块,这些模块可以将QA实例重新作为前置合制的组合,并具有极高的可靠性。 然后,通过将标准的NLI数据集与自动从质量保证培训数据中得出的国家扫盲指数实例结合起来,我们可以培训国家扫盲指数模型来判断质量保证模型的正确性(文件背景)以支持假设(问题的拟议答案 ) 。 我们表明,我们的国家扫盲指数方法一般可以改进对不同领域质量保证模型的信任估计,在有选择的量化指数设置中加以评估。对国家扫盲指数模型的预测进行仔细的手工分析,表明它能够进一步确定QA模型得出正确答案的案例,或者答案无法作为所有错误原因的答案。