Reasoning over natural language is a challenging problem in NLP. In this work, we focus on proof generation: Given a hypothesis and a set of supporting facts, the model generates a proof tree indicating how to derive the hypothesis from supporting facts. Compared to generating the entire proof in one shot, stepwise generation can better exploit the compositionality and generalize to longer proofs but has achieved limited success on real-world data. Existing stepwise methods struggle to generate proof steps that are both logically valid and relevant to the hypothesis. Instead, they tend to hallucinate invalid steps given the hypothesis. In this paper, we present a novel stepwise method, NLProofS (Natural Language Proof Search), which learns to generate relevant steps conditioning on the hypothesis. At the core of our approach, we train an independent verifier to check the validity of the proof steps to prevent hallucination. Instead of generating steps greedily, we search for proofs maximizing a global proof score judged by the verifier. NLProofS achieves state-of-the-art performance on EntailmentBank and RuleTaker. Specifically, it improves the correctness of predicted proofs from 27.7% to 33.3% in the distractor setting of EntailmentBank, demonstrating the effectiveness of NLProofS in generating challenging human-authored proofs.
翻译:自然语言是NLP中一个具有挑战性的问题。 在这项工作中,我们侧重于证据生成:根据假设和一系列支持事实,模型产生一个证明树,表明如何从事实中得出假设。与以一针制出整个证据相比,分步生成可以更好地利用构成性,概括到更长的证明,但在真实世界数据上却取得了有限的成功。现有的渐进方法在努力生成在逻辑上既有效,又与假设相关的证据步骤。相反,它们倾向于在假设中产生错误的无效步骤。在本文中,我们提出了一个新的分步方法,即NLProofS(自然语言校准搜索),该方法学会在假设上产生相关步骤。在我们的方法中,我们训练一个独立的核查员,检查证据步骤的有效性,以防止幻觉。我们不贪婪地寻找证据,以最大限度地实现由验证人判断的全球证据分数。NLProofS在EtailmentBank和Rower中实现了最先进的表现。具体地说,它改进了对Ban3的可靠性的准确性,从27号中展示了对Basrial的可靠性的正确性证明。