Despite progress in language model (LM) capabilities, evaluations have thus far focused on models' performance on tasks that humans have previously solved, including in programming (Jimenez et al., 2024) and mathematics (Glazer et al., 2024). We therefore propose testing models' ability to design and implement algorithms in an open-ended benchmark: We task LMs with writing code that efficiently solves computationally challenging problems in computer science, physics, and mathematics. Our AlgoTune benchmark consists of 154 coding tasks collected from domain experts and a framework for validating and timing LM-synthesized solution code, which is compared to reference implementations from popular open-source packages. In addition, we develop a baseline LM agent, AlgoTuner, and evaluate its performance across a suite of frontier models. AlgoTuner uses a simple, budgeted loop that edits code, compiles and runs it, profiles performance, verifies correctness on tests, and selects the fastest valid version. AlgoTuner achieves an average 1.72x speedup against our reference solvers, which use libraries such as SciPy, sk-learn and CVXPY. However, we find that current models fail to discover algorithmic innovations, instead preferring surface-level optimizations. We hope that AlgoTune catalyzes the development of LM agents exhibiting creative problem solving beyond state-of-the-art human performance.
翻译:尽管语言模型(LM)能力已取得进展,但现有评估主要关注模型在人类已解决任务上的表现,包括编程(Jimenez等人,2024)和数学(Glazer等人,2024)领域。为此,我们提出在开放式基准测试中检验模型设计与实现算法的能力:要求语言模型编写能高效解决计算机科学、物理学和数学领域计算挑战性问题的代码。AlgoTune基准包含154个从领域专家收集的编程任务,以及用于验证和计时LM合成解决方案代码的框架,这些代码将与主流开源软件包的参考实现进行对比。此外,我们开发了基线LM智能体AlgoTuner,并在前沿模型套件中评估其性能。AlgoTuner采用简单的预算循环机制,通过编辑代码、编译运行、性能剖析、测试验证等步骤,最终选择最快的有效版本。相较于使用SciPy、sk-learn和CVXPY等库的参考求解器,AlgoTuner实现了平均1.72倍的加速。然而,我们发现当前模型未能实现算法创新,更倾向于表面层级的优化。我们期待AlgoTune能推动开发出具备创造性问题解决能力、超越人类最先进水平的语言模型智能体。