多智能体大语言模型委员会用于自主软件Beta测试 (Multi-Agent LLM Committees for Autonomous Software Beta Testing)

Manual software beta testing is costly and time-consuming, while single-agent large language model (LLM) approaches suffer from hallucinations and inconsistent behavior. We propose a multi-agent committee framework in which diverse vision-enabled LLMs collaborate through a three-round voting protocol to reach consensus on testing actions. The framework combines model diversity, persona-driven behavioral variation, and visual user interface understanding to systematically explore web applications. Across 84 experimental runs with 9 testing personas and 4 scenarios, multi-agent committees achieve an 89.5 percent overall task success rate. Configurations with 2 to 4 agents reach 91.7 to 100 percent success, compared to 78.0 percent for single-agent baselines, yielding improvements of 13.7 to 22.0 percentage points. At the action level, the system attains a 93.1 percent success rate with a median per-action latency of 0.71 seconds, enabling real-time and continuous integration testing. Vision-enabled agents successfully identify user interface elements, with navigation and reporting achieving 100 percent success and form filling achieving 99.2 percent success. We evaluate the framework on WebShop and OWASP benchmarks, achieving 74.7 percent success on WebShop compared to a 50.1 percent published GPT-3 baseline, and 82.0 percent success on OWASP Juice Shop security testing with coverage of 8 of the 10 OWASP Top 10 vulnerability categories. Across 20 injected regressions, the committee achieves an F1 score of 0.91 for bug detection, compared to 0.78 for single-agent baselines. The open-source implementation enables reproducible research and practical deployment of LLM-based software testing in CI/CD pipelines.

翻译：手动软件Beta测试成本高昂且耗时，而单智能体大语言模型方法存在幻觉和行为不一致的问题。我们提出了一种多智能体委员会框架，其中多样化的视觉赋能大语言模型通过三轮投票协议协作，就测试行动达成共识。该框架结合了模型多样性、角色驱动的行为变化以及视觉用户界面理解，以系统性地探索Web应用程序。在涵盖9种测试角色和4种场景的84次实验运行中，多智能体委员会实现了89.5%的总体任务成功率。配置2至4个智能体的设置达到了91.7%至100%的成功率，而单智能体基线为78.0%，带来了13.7至22.0个百分点的提升。在行动层面，系统实现了93.1%的成功率，每次行动的中位延迟为0.71秒，支持实时和持续集成测试。视觉赋能智能体成功识别用户界面元素，导航和报告达到100%成功率，表单填写达到99.2%成功率。我们在WebShop和OWASP基准测试上评估了该框架，在WebShop上实现了74.7%的成功率（相比已发表的GPT-3基线50.1%），在OWASP Juice Shop安全测试中实现了82.0%的成功率，覆盖了OWASP Top 10漏洞类别中的8类。在20个注入的回归问题中，委员会的错误检测F1分数达到0.91，而单智能体基线为0.78。该开源实现支持在CI/CD管道中进行基于大语言模型的软件测试的可复现研究和实际部署。