Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, leading to their adoption in high-stakes domains such as healthcare, law, and scientific research. However, their reasoning often contains subtle logical errors masked by fluent language, posing significant risks for critical applications. While existing approaches like fact-checking, self-consistency methods, and rule-based validation provide partial solutions, they fail to detect complex logical flaws in multi-step reasoning. To overcome these challenges, we present MATP, an evaluation framework for systematically verifying LLM reasoning via Multi-step Automatic Theorem Proving. MATP translates natural language reasoning into First-Order Logic (FOL) and applies automated theorem provers to assess step-by-step logical validity. This approach identifies hidden logical errors and provides fine-grained classifications of reasoning correctness. Evaluations on a benchmark comprising 10,830 reasoning instances generated by 10 LLMs across tasks from PrOntoQA-OOD, ProofWriter, and FOLIO show that MATP surpasses prompting-based baselines by over 42 percentage points in reasoning step verification. It further reveals model-level disparities, with reasoning models generating more logically coherent outputs than general models. These results demonstrate MATP's potential to enhance the trustworthiness of LLM-generated reasoning.
翻译:大语言模型(LLMs)已展现出令人瞩目的推理能力,促使其在医疗、法律和科学研究等高风险领域得到应用。然而,其推理过程常包含被流畅语言所掩盖的细微逻辑错误,这对关键应用场景构成了重大风险。尽管现有方法(如事实核查、自洽性方法和基于规则的验证)提供了部分解决方案,但它们无法检测多步推理中复杂的逻辑缺陷。为应对这些挑战,我们提出了MATP——一个通过多步自动定理证明系统验证LLM推理的评估框架。MATP将自然语言推理转化为一阶逻辑(FOL),并应用自动定理证明器来逐步评估逻辑有效性。该方法能识别隐藏的逻辑错误,并对推理正确性进行细粒度分类。在包含10,830个推理实例的基准测试(涵盖PrOntoQA-OOD、ProofWriter和FOLIO任务,由10种LLM生成)上的评估表明,MATP在推理步骤验证方面优于基于提示的基线方法超过42个百分点。研究进一步揭示了模型层面的差异:专用推理模型比通用模型能生成逻辑更连贯的输出。这些结果证明了MATP在提升LLM生成推理的可信度方面具有重要潜力。