Much of the reported progress in file-level software defect prediction (SDP) is, in reality, nothing but an illusion of accuracy. Over the last decades, machine learning and deep learning models have reported increasing performance across software versions. However, since most files persist across releases and retain their defect labels, standard evaluation rewards label-persistence bias rather than reasoning about code changes. To address this issue, we reformulate SDP as a change-aware prediction task, in which models reason over code changes of a file within successive project versions, rather than relying on static file snapshots. Building on this formulation, we propose an LLM-driven, change-aware, multi-agent debate framework. Our experiments on multiple PROMISE projects show that traditional models achieve inflated F1, while failing on rare but critical defect-transition cases. In contrast, our change-aware reasoning and multi-agent debate framework yields more balanced performance across evolution subsets and significantly improves sensitivity to defect introductions. These results highlight fundamental flaws in current SDP evaluation practices and emphasize the need for change-aware reasoning in practical defect prediction. The source code is publicly available.
翻译:文件级软件缺陷预测(SDP)领域所报告的大部分进展,实际上仅是准确性的幻象。过去几十年间,机器学习和深度学习模型在跨软件版本中报告了不断提升的性能。然而,由于大多数文件在版本迭代中持续存在并保留其缺陷标签,标准评估方法奖励的是标签持续性偏差,而非对代码变更的推理。为解决此问题,我们将SDP重新定义为一项变更感知的预测任务,其中模型对连续项目版本中文件的代码变更进行推理,而非依赖静态文件快照。基于此定义,我们提出了一个由大语言模型驱动的、变更感知的多智能体辩论框架。我们在多个PROMISE项目上的实验表明,传统模型获得了虚高的F1分数,却在罕见但关键的缺陷状态转换案例上失败。相比之下,我们的变更感知推理与多智能体辩论框架在演化子集上实现了更均衡的性能,并显著提升了对缺陷引入的敏感性。这些结果揭示了当前SDP评估实践的根本缺陷,并强调了在实际缺陷预测中采用变更感知推理的必要性。源代码已公开。