Nexus：基于执行的多智能体测试预言合成框架 (Nexus: Execution-Grounded Multi-Agent Test Oracle Synthesis)

Test oracle generation in non-regression testing is a longstanding challenge in software engineering, where the goal is to produce oracles that can accurately determine whether a function under test (FUT) behaves as intended for a given input. In this paper, we introduce Nexus, a novel multi-agent framework to address this challenge. Nexus generates test oracles by leveraging a diverse set of specialized agents that synthesize test oracles through a structured process of deliberation, validation, and iterative self-refinement. During the deliberation phase, a panel of four specialist agents, each embodying a distinct testing philosophy, collaboratively critiques and refines an initial set of test oracles. Then, in the validation phase, Nexus generates a plausible candidate implementation of the FUT and executes the proposed oracles against it in a secure sandbox. For any oracle that fails this execution-based check, Nexus activates an automated selfrefinement loop, using the specific runtime error to debug and correct the oracle before re-validation. Our extensive evaluation on seven diverse benchmarks demonstrates that Nexus consistently and substantially outperforms state-of-theart baselines. For instance, Nexus improves the test-level oracle accuracy on the LiveCodeBench from 46.30% to 57.73% for GPT-4.1-Mini. The improved accuracy also significantly enhances downstream tasks: the bug detection rate of GPT4.1-Mini generated test oracles on HumanEval increases from 90.91% to 95.45% for Nexus compared to baselines, and the success rate of automated program repair improves from 35.23% to 69.32%.

翻译：非回归测试中的测试预言生成是软件工程领域长期存在的挑战，其目标是为被测函数（FUT）在给定输入下是否按预期行为产生能够准确判断的预言。本文提出Nexus，一种新颖的多智能体框架以应对此挑战。Nexus通过利用一组多样化的专业智能体，经过结构化审议、验证和迭代自我精炼过程来合成测试预言。在审议阶段，由四个分别体现不同测试理念的专业智能体组成评审组，对初始测试预言集进行协作式批判与优化。随后在验证阶段，Nexus生成FUT的合理候选实现，并在安全沙箱中对提议的预言执行验证。对于未通过此基于执行检查的预言，Nexus会启动自动化自我精炼循环，利用具体运行时错误对预言进行调试修正后重新验证。我们在七个多样化基准测试上的广泛评估表明，Nexus持续且显著优于现有最先进基线方法。例如在LiveCodeBench基准上，Nexus将GPT-4.1-Mini的测试级预言准确率从46.30%提升至57.73。准确率的提升也显著增强了下游任务性能：相较于基线方法，Nexus使GPT-4.1-Mini在HumanEval上生成的测试预言缺陷检测率从90.91%提高至95.45%，自动化程序修复成功率从35.23%提升至69.32%。