As large language models (LLMs) become increasingly capable and widely adopted, benchmarks play a central role in assessing their practical utility. For example, SWE-Bench Verified has emerged as a critical benchmark for evaluating LLMs' software engineering abilities, particularly their aptitude for resolving real-world GitHub issues. Recent LLMs show impressive performance on SWE-Bench, leading to optimism about their capacity for complex coding tasks. However, current evaluation protocols may overstate these models' true capabilities. It is crucial to distinguish LLMs' generalizable problem-solving ability and other learned artifacts. In this work, we introduce two diagnostic tasks: file path identification from issue descriptions alone and ground truth function reproduction with only the current file context and issue description to probe models' underlying knowledge. We present empirical evidence that performance gains on SWE-Bench-Verified may be partially driven by memorization rather than genuine problem-solving. We show that state-of-the-art models achieve up to 76% accuracy in identifying buggy file paths using only issue descriptions, without access to repository structure. This performance is merely up to 53% on tasks from repositories not included in SWE-Bench, pointing to possible data contamination or memorization. Similar patterns are also observed for the function reproduction task, where the verbatim similarity is much higher on SWE-Bench Verified than on other similar coding benchmarks (up to 35% consecutive 5-gram accuracy on SWE-Bench Verified and Full, but only up to 18% for tasks in other benchmarks). These findings raise concerns about the validity of existing results and underscore the need for more robust, contamination-resistant benchmarks to reliably evaluate LLMs' coding abilities.
翻译:随着大语言模型(LLMs)能力不断增强并被广泛采用,基准测试在评估其实用性方面发挥着核心作用。例如,SWE-Bench Verified 已成为评估 LLMs 软件工程能力的关键基准,特别是其解决实际 GitHub 问题的能力。近期 LLMs 在 SWE-Bench 上表现出色,引发了对其处理复杂编码任务能力的乐观预期。然而,当前的评估方法可能高估了这些模型的真实能力。区分 LLMs 的泛化问题解决能力与其他习得性表征至关重要。本研究引入两项诊断任务:仅基于问题描述的文件路径识别,以及仅基于当前文件上下文和问题描述的真实函数复现,以探究模型的底层知识。我们提供的实证证据表明,SWE-Bench-Verified 上的性能提升可能部分源于记忆而非真正的解决问题能力。实验显示,最先进的模型仅使用问题描述(无需访问仓库结构)识别错误文件路径的准确率最高可达 76%,而在未包含于 SWE-Bench 的仓库任务中,该准确率最高仅为 53%,这暗示可能存在数据污染或记忆效应。在函数复现任务中也观察到类似模式:SWE-Bench Verified 上的逐字相似度显著高于其他同类编码基准(SWE-Bench Verified 和 Full 基准中连续 5-gram 准确率最高达 35%,而其他基准任务最高仅 18%)。这些发现对现有结果的有效性提出质疑,并强调需要建立更稳健、抗污染的基准来可靠评估 LLMs 的编码能力。