Models for textual entailment have increasingly been applied to settings like fact-checking, presupposition verification in question answering, and validating that generation models' outputs are faithful to a source. However, such applications are quite far from the settings that existing datasets are constructed in. We propose WiCE, a new textual entailment dataset centered around verifying claims in text, built on real-world claims and evidence in Wikipedia with fine-grained annotations. We collect sentences in Wikipedia that cite one or more webpages and annotate whether the content on those pages entails those sentences. Negative examples arise naturally, from slight misinterpretation of text to minor aspects of the sentence that are not attested in the evidence. Our annotations are over sub-sentence units of the hypothesis, decomposed automatically by GPT-3, each of which is labeled with a subset of evidence sentences from the source document. We show that real claims in our dataset involve challenging verification problems, and we benchmark existing approaches on this dataset. In addition, we show that reducing the complexity of claims by decomposing them by GPT-3 can improve entailment models' performance on various domains.
翻译:文字要求模型已越来越多地应用于诸如事实核对、预先假定的回答核实等设置,并证实生成模型的输出忠实于源。然而,这些应用与现有数据集所建的设置相去甚远。我们提议WICE,这是一个新的文字要求数据集,以文字核查索赔为中心,以真实世界的主张和维基百科的证据为基础,并配有细微的注释。我们在维基百科收集引用一个或多个网页的句子,并注明这些页面的内容是否包含这些句子。从文本略微误读到未在证据中证实的句子的细小方面自然产生消极的例子。我们的说明超越了假设的次句子,由GPT-3自动解脱,其中每个部分都标有源文件的一系列证据句子。我们显示,我们的数据集中真实要求涉及具有挑战性的核查问题,我们用这个数据集来衡量现有的方法。此外,我们表明,通过GPT-3解压缩这些参数来降低索赔的复杂性,可以改进不同域的假设模型的性能改进。</s>