ReplicationBench：AI智能体能否复现天体物理学研究论文？ (ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?)

Christine Ye,Sihan Yuan,Suchetha Cooray,Steven Dillmann,Ian L. V. Roque,Dalya Baron,Philipp Frank,Sergio Martin-Alvarez,Nolan Koblischke,Frank J Qu,Diyi Yang,Risa Wechsler,Ioana Ciuca

Frontier AI agents show increasing promise as scientific research assistants, and may eventually be useful for extended, open-ended research workflows. However, in order to use agents for novel research, we must first assess the underlying faithfulness and correctness of their work. To evaluate agents as research assistants, we introduce ReplicationBench, an evaluation framework that tests whether agents can replicate entire research papers drawn from the astrophysics literature. Astrophysics, where research relies heavily on archival data and computational study while requiring little real-world experimentation, is a particularly useful testbed for AI agents in scientific research. We split each paper into tasks which require agents to replicate the paper's core contributions, including the experimental setup, derivations, data analysis, and codebase. Each task is co-developed with the original paper authors and targets a key scientific result, enabling objective evaluation of both faithfulness (adherence to original methods) and correctness (technical accuracy of results). ReplicationBench is extremely challenging for current frontier language models: even the best-performing language models score under 20%. We analyze ReplicationBench trajectories in collaboration with domain experts and find a rich, diverse set of failure modes for agents in scientific research. ReplicationBench establishes the first benchmark of paper-scale, expert-validated astrophysics research tasks, reveals insights about agent performance generalizable to other domains of data-driven science, and provides a scalable framework for measuring AI agents' reliability in scientific research.

翻译：前沿AI智能体作为科学研究助手的潜力日益显现，未来或可应用于扩展性、开放式的科研工作流。然而，为将智能体用于创新性研究，我们首先需要评估其工作的底层忠实性与正确性。为评估智能体作为科研助手的能力，我们提出ReplicationBench评估框架，该框架测试智能体能否完整复现天体物理学文献中的研究论文。天体物理学研究高度依赖档案数据与计算研究，且几乎无需现实实验，这使其成为AI智能体在科学研究中的理想测试场。我们将每篇论文拆解为多个任务，要求智能体复现论文的核心贡献，包括实验设置、公式推导、数据分析和代码库。每个任务均与论文原作者共同设计，并针对关键科学成果，从而实现对忠实性（遵循原始方法）与正确性（结果的技术准确性）的客观评估。ReplicationBench对当前前沿语言模型极具挑战性：即使表现最优的语言模型得分也低于20%。我们与领域专家合作分析ReplicationBench的任务轨迹，发现智能体在科学研究中存在丰富多样的失败模式。ReplicationBench建立了首个经专家验证的、论文尺度的天体物理研究任务基准，揭示了可推广至其他数据驱动科学领域的智能体性能洞察，并为衡量AI智能体在科学研究中的可靠性提供了可扩展的框架。