FLAWS：科学论文中错误识别与定位的基准测试 (FLAWS: A Benchmark for Error Identification and Localization in Scientific Papers)

The identification and localization of errors is a core task in peer review, yet the exponential growth of scientific output has made it increasingly difficult for human reviewers to reliably detect errors given the limited pool of experts. Recent advances in Large Language Models (LLMs) have sparked interest in their potential to support such evaluation tasks, from academic peer review to automated scientific assessment. However, despite the growing use of LLMs in review systems, their capabilities to pinpoint errors remain underexplored. In this work, we introduce Fault Localization Across Writing in Science (FLAWS), an automated benchmark consisting of 713 paper-error pairs designed to evaluate how effectively LLMs detect errors that undermine key claims in research papers. We construct the benchmark by systematically inserting claim-invalidating errors into peer-reviewed papers using LLMs, paired with an automated evaluation metric that measures whether models can identify and localize these errors. Developing such a benchmark presents unique challenges that we overcome: ensuring that the inserted errors are well-defined, challenging, and relevant to the content of the paper, avoiding artifacts that would make identification trivial, and designing a scalable, automated evaluation metric. On the resulting benchmark, we evaluate five frontier LLMs: Claude Sonnet 4.5, DeepSeek Reasoner v3.1, Gemini 2.5 Pro, GPT 5, and Grok 4. Among these, GPT 5 is the top-performing model, achieving 39.1% identification accuracy when k=10, where k is the number of top-ranked error text candidates generated by the LLM.

翻译：错误识别与定位是同行评审的核心任务，然而科学产出的指数级增长使得人类评审者在有限的专家资源下难以可靠地检测错误。大型语言模型（LLMs）的最新进展引发了对其在学术同行评审到自动化科学评估等评价任务中应用潜力的关注。尽管LLMs在评审系统中的使用日益增多，但其定位错误的能力仍未得到充分探索。本研究提出“科学写作中的错误定位基准”（FLAWS），这是一个包含713个论文-错误对的自动化基准，旨在评估LLMs检测损害研究论文关键主张的错误的有效性。我们通过使用LLMs系统性地向同行评审论文中插入使主张失效的错误来构建该基准，并配以自动化评估指标，用于衡量模型能否识别并定位这些错误。开发此类基准面临独特挑战，我们通过以下方式克服：确保插入的错误定义明确、具有挑战性且与论文内容相关，避免使识别变得简单的伪影，并设计可扩展的自动化评估指标。在所得基准上，我们评估了五个前沿LLMs：Claude Sonnet 4.5、DeepSeek Reasoner v3.1、Gemini 2.5 Pro、GPT 5和Grok 4。其中，GPT 5表现最佳，当k=10（即LLM生成的排名前10的错误文本候选数量）时，识别准确率达到39.1%。