As LLMs advance their reasoning capabilities about the physical world, the absence of rigorous benchmarks for evaluating their ability to generate scientifically valid physical models has become a critical gap. Computational mechanics, which develops and applies mathematical models and numerical methods to predict the behavior of physical systems under forces, deformation, and constraints, provides an ideal foundation for structured scientific reasoning evaluation. Problems follow clear mathematical structure, enforce strict physical and numerical constraints, and support objective verification. The discipline requires constructing explicit models of physical systems and reasoning about geometry, spatial relationships, and material behavior, connecting directly to emerging AI goals in physical reasoning and world modeling. We introduce FEM-Bench, a computational mechanics benchmark designed to evaluate the ability of LLMs to generate correct finite element method (FEM) and related code. FEM-Bench 2025 contains a suite of introductory but nontrivial tasks aligned with material from a first graduate course on computational mechanics. These tasks capture essential numerical and physical modeling challenges while representing only a small fraction of the complexity present in the discipline. Despite their simplicity, state-of-the-art LLMs do not reliably solve all of them. In a five attempt run, the best performing model at function writing, Gemini 3 Pro, completed 30/33 tasks at least once and 26/33 tasks all five times. The best performing model at unit test writing, GPT-5, had an Average Joint Success Rate of 73.8%. Other popular models showed broad performance variation. FEM-Bench establishes a structured foundation for evaluating AI-generated scientific code, and future iterations will incorporate increasingly sophisticated tasks to track progress as models evolve.
翻译:随着大语言模型在物理世界推理能力上的进步,缺乏用于评估其生成科学有效物理模型能力的严格基准已成为一个关键缺口。计算力学通过开发和应用数学模型与数值方法来预测物理系统在力、变形及约束下的行为,为结构化科学推理评估提供了理想基础。其问题遵循清晰的数学结构,强制执行严格的物理与数值约束,并支持客观验证。该学科要求构建物理系统的显式模型,并对几何、空间关系及材料行为进行推理,这与人工智能在物理推理和世界建模方面的新兴目标直接相连。我们推出FEM-Bench,这是一个计算力学基准,旨在评估大语言模型生成正确有限元法及相关代码的能力。FEM-Bench 2025包含一系列与计算力学研究生入门课程内容对齐的、入门级但非平凡的任务。这些任务捕捉了核心的数值与物理建模挑战,同时仅代表了该学科复杂性的很小一部分。尽管任务相对简单,当前最先进的大语言模型仍无法可靠地解决所有问题。在五次尝试运行中,函数编写表现最佳的模型Gemini 3 Pro,在33项任务中至少完成30项一次,并在26项任务中五次全部完成。单元测试编写表现最佳的模型GPT-5,其平均联合成功率为73.8%。其他流行模型则表现出广泛的性能差异。FEM-Bench为评估AI生成科学代码建立了一个结构化基础,未来迭代将纳入日益复杂的任务,以追踪模型演进过程中的进展。