Benchmarks are essential for unified evaluation and reproducibility. The rapid rise of Artificial Intelligence for Software Engineering (AI4SE) has produced numerous benchmarks for tasks such as code generation and bug repair. However, this proliferation has led to major challenges: (1) fragmented knowledge across tasks, (2) difficulty in selecting contextually relevant benchmarks, (3) lack of standardization in benchmark creation, and (4) flaws that limit utility. Addressing these requires a dual approach: systematically mapping existing benchmarks for informed selection and defining unified guidelines for robust, adaptable benchmark development. We conduct a review of 247 studies, identifying 273 AI4SE benchmarks since 2014. We categorize them, analyze limitations, and expose gaps in current practices. Building on these insights, we introduce BenchScout, an extensible semantic search tool for locating suitable benchmarks. BenchScout employs automated clustering with contextual embeddings of benchmark-related studies, followed by dimensionality reduction. In a user study with 22 participants, BenchScout achieved usability, effectiveness, and intuitiveness scores of 4.5, 4.0, and 4.1 out of 5. To improve benchmarking standards, we propose BenchFrame, a unified framework for enhancing benchmark quality. Applying BenchFrame to HumanEval yielded HumanEvalNext, featuring corrected errors, improved language conversion, higher test coverage, and greater difficulty. Evaluating 10 state-of-the-art code models on HumanEval, HumanEvalPlus, and HumanEvalNext revealed average pass-at-1 drops of 31.22% and 19.94%, respectively, underscoring the need for continuous benchmark refinement. We further examine BenchFrame's scalability through an agentic pipeline and confirm its generalizability on the MBPP dataset. All review data, user study materials, and enhanced benchmarks are publicly released.
翻译:基准测试对于统一评估与可复现性至关重要。人工智能赋能软件工程(AI4SE)的快速发展催生了大量基准测试,覆盖代码生成、缺陷修复等任务。然而,这种激增也带来了主要挑战:(1)跨任务知识碎片化,(2)难以选择情境相关的基准测试,(3)基准创建缺乏标准化,(4)存在限制实用性的缺陷。解决这些问题需要双重路径:系统梳理现有基准以支持知情选择,并制定统一指南以开发稳健、适应性强的基准。我们对247项研究进行了综述,识别出自2014年以来的273个AI4SE基准测试,对其分类、分析局限性并揭示当前实践的不足。基于这些洞见,我们推出了BenchScout——一个用于定位合适基准的可扩展语义搜索工具。BenchScout采用自动化聚类技术,结合基准相关研究的上下文嵌入表示,并进行降维处理。在一项涉及22名参与者的用户研究中,BenchScout在可用性、有效性和直观性上分别获得4.5、4.0和4.1分(满分5分)。为提升基准测试标准,我们提出了BenchFrame——一个用于提高基准质量的统一框架。将BenchFrame应用于HumanEval后,我们开发出HumanEvalNext,其特点包括修正错误、改进语言转换、提高测试覆盖率和增加难度。在HumanEval、HumanEvalPlus和HumanEvalNext上评估10个前沿代码模型后,平均pass-at-1指标分别下降31.22%和19.94%,凸显了持续优化基准的必要性。我们通过智能体流程进一步检验了BenchFrame的可扩展性,并在MBPP数据集上验证了其泛化能力。所有综述数据、用户研究材料及增强版基准均已公开发布。