Reasoning in language models is difficult to evaluate: natural-language traces are unverifiable, symbolic datasets too small, and most benchmarks conflate heuristics with inference. We present FOL-Traces, the first large-scale dataset of programmatically verified reasoning traces, enabling rigorous evaluation of structured logical inference. We also propose two challenging and comprehensive diagnostic tasks-masked operation prediction and step completion-that directly probe syntactic awareness and process fidelity. FOL-Traces serves as a scalable testbed for rigorously studying how models perform structured logical inference. Systematic experiments with 5 reasoning LLMs show that the dataset remains challenging: models only reach around 45.7% accuracy on masked operation prediction and around 27% on two-step completion.
翻译:语言模型的推理能力难以评估:自然语言轨迹不可验证,符号数据集规模过小,且大多数基准测试将启发式方法与推理过程混为一谈。我们提出了FOL-Traces,这是首个大规模程序化验证的推理轨迹数据集,能够对结构化逻辑推理进行严格评估。我们还提出了两项具有挑战性且全面的诊断任务——掩码操作预测与步骤补全,直接探究模型的语法意识与过程保真度。FOL-Traces作为一个可扩展的测试平台,为严格研究模型如何执行结构化逻辑推理提供了基础。对5个推理大语言模型的系统实验表明,该数据集仍具挑战性:模型在掩码操作预测任务中仅达到约45.7%的准确率,在两步补全任务中约为27%。