智能语言模型：追踪测试异味 (Agentic LMs: Hunting Down Test Smells)

Test smells reduce test suite reliability and complicate maintenance. While many methods detect test smells, few support automated removal, and most rely on static analysis or machine learning. This study evaluates models with relatively small parameter counts - Llama-3.2-3B, Gemma-2-9B, DeepSeek-R1-14B, and Phi-4-14B - for their ability to detect and refactor test smells using agent-based workflows. We assess workflows with one, two, and four agents over 150 instances of 5 common smells from real-world Java projects. Our approach generalizes to Python, Golang, and JavaScript. All models detected nearly all instances, with Phi-4-14B achieving the best refactoring accuracy (pass@5 of 75.3%). Phi-4-14B with four-agents performed within 5% of proprietary LLMs (single-agent). Multi-agent setups outperformed single-agent ones in three of five smell types, though for Assertion Roulette, one agent sufficed. We submitted pull requests with Phi-4-14B-generated code to open-source projects and six were merged.

翻译：测试异味会降低测试套件的可靠性并增加维护复杂性。尽管已有多种方法用于检测测试异味，但鲜有支持自动化移除的方案，且多数依赖于静态分析或机器学习。本研究评估了参数量相对较小的模型——Llama-3.2-3B、Gemma-2-9B、DeepSeek-R1-14B和Phi-4-14B——在基于智能体工作流中检测与重构测试异味的能力。我们通过单智能体、双智能体和四智能体工作流，对来自真实Java项目的5类常见异味共计150个实例进行了评估。该方法可推广至Python、Golang和JavaScript语言。所有模型均能检测出近全部实例，其中Phi-4-14B实现了最佳重构准确率（pass@5达75.3%）。四智能体配置的Phi-4-14B与商用大语言模型（单智能体）的性能差距在5%以内。在五类异味中，多智能体配置在三类上优于单智能体，但对于"断言轮盘"异味，单智能体已足够。我们向开源项目提交了由Phi-4-14B生成的代码拉取请求，其中六项已被合并。