While research on scientific claim verification has led to the development of powerful systems that appear to approach human performance, these approaches have yet to be tested in a realistic setting against large corpora of scientific literature. Moving to this open-domain evaluation setting, however, poses unique challenges; in particular, it is infeasible to exhaustively annotate all evidence documents. In this work, we present SciFact-Open, a new test collection designed to evaluate the performance of scientific claim verification systems on a corpus of 500K research abstracts. Drawing upon pooling techniques from information retrieval, we collect evidence for scientific claims by pooling and annotating the top predictions of four state-of-the-art scientific claim verification models. We find that systems developed on smaller corpora struggle to generalize to SciFact-Open, exhibiting performance drops of at least 15 F1. In addition, analysis of the evidence in SciFact-Open reveals interesting phenomena likely to appear when claim verification systems are deployed in practice, e.g., cases where the evidence supports only a special case of the claim. Our dataset is available at https://github.com/dwadden/scifact-open.
翻译:虽然关于科学主张核查的研究已导致开发了似乎接近人类绩效的强大系统,但这些方法尚未在与大量科学文献公司对照的现实环境中进行测试。然而,转向这种开放域评价环境提出了独特的挑战;特别是,无法详尽无遗地说明所有证据文件。在这项工作中,我们介绍了旨在评价科学主张核查系统在500K研究摘要汇编上的业绩的新测试集SciFact-Open。利用信息检索的汇集技术,我们收集科学索赔的证据,通过汇集和说明四个最先进的科学主张核查模型的顶级预测。我们发现,在小型公司开发的系统努力向SciFact-Open推广,显示至少15F1.此外,对SciFact-Open的证据的分析显示,在实际应用索赔核查系统时,可能出现有趣的现象,例如,证据只支持索赔的特殊案例。我们在https://github.com/waact.d.