We introduce AInsteinBench, a large-scale benchmark for evaluating whether large language model (LLM) agents can operate as scientific computing development agents within real research software ecosystems. Unlike existing scientific reasoning benchmarks which focus on conceptual knowledge, or software engineering benchmarks that emphasize generic feature implementation and issue resolving, AInsteinBench evaluates models in end-to-end scientific development settings grounded in production-grade scientific repositories. The benchmark consists of tasks derived from maintainer-authored pull requests across six widely used scientific codebases, spanning quantum chemistry, quantum computing, molecular dynamics, numerical relativity, fluid dynamics, and cheminformatics. All benchmark tasks are carefully curated through multi-stage filtering and expert review to ensure scientific challenge, adequate test coverage, and well-calibrated difficulty. By leveraging evaluation in executable environments, scientifically meaningful failure modes, and test-driven verification, AInsteinBench measures a model's ability to move beyond surface-level code generation toward the core competencies required for computational scientific research.
翻译:本文提出AInsteinBench,这是一个用于评估大型语言模型(LLM)智能体能否在真实研究软件生态系统中作为科学计算开发智能体运行的大规模基准测试。与现有聚焦概念性知识的科学推理基准测试,或强调通用功能实现与问题解决的软件工程基准测试不同,AInsteinBench将模型置于基于生产级科学代码库的端到端科学开发场景中进行评估。该基准测试包含从六个广泛使用的科学代码库中由维护者提交的拉取请求衍生出的任务,涵盖量子化学、量子计算、分子动力学、数值相对论、流体动力学和化学信息学等领域。所有基准任务均经过多阶段筛选和专家评审精心策划,以确保其科学挑战性、充分的测试覆盖率和良好校准的难度。通过利用可执行环境中的评估、具有科学意义的失败模式以及测试驱动的验证,AInsteinBench衡量模型超越表层代码生成、迈向计算科学研究所需核心能力的能力。