Software Defect Prediction (SDP) models are central to proactive software quality assurance, yet their effectiveness is often constrained by the quality of available datasets. Prior research has typically examined single issues such as class imbalance or feature irrelevance in isolation, overlooking that real-world data problems frequently co-occur and interact. This study presents, to our knowledge, the first large-scale empirical analysis in SDP that simultaneously examines five co-occurring data quality issues (class imbalance, class overlap, irrelevant features, attribute noise, and outliers) across 374 datasets and five classifiers. We employ Explainable Boosting Machines together with stratified interaction analysis to quantify both direct and conditional effects under default hyperparameter settings, reflecting practical baseline usage. Our results show that co-occurrence is nearly universal: even the least frequent issue (attribute noise) appears alongside others in more than 93% of datasets. Irrelevant features and imbalance are nearly ubiquitous, while class overlap is the most consistently harmful issue. We identify stable tipping points around 0.20 for class overlap, 0.65-0.70 for imbalance, and 0.94 for irrelevance, beyond which most models begin to degrade. We also uncover counterintuitive patterns, such as outliers improving performance when irrelevant features are low, underscoring the importance of context-aware evaluation. Finally, we expose a performance-robustness trade-off: no single learner dominates under all conditions. By jointly analyzing prevalence, co-occurrence, thresholds, and conditional effects, our study directly addresses a persistent gap in SDP research. Hence, moving beyond isolated analyses to provide a holistic, data-aware understanding of how quality issues shape model performance in real-world settings.


翻译:软件缺陷预测(SDP)模型是主动式软件质量保障的核心,但其有效性常受限于可用数据集的质量。先前研究通常孤立地考察单一问题,如类别不平衡或特征无关性,忽略了现实世界中的数据问题常共现且相互影响。本研究据我们所知,首次在SDP领域开展了大规模实证分析,同步考察了374个数据集和五种分类器中的五种共现数据质量问题(类别不平衡、类别重叠、无关特征、属性噪声和异常值)。我们采用可解释提升机结合分层交互分析,量化默认超参数设置下的直接效应与条件效应,以反映实际基线使用情况。结果表明,共现现象近乎普遍:即使出现频率最低的问题(属性噪声)也在超过93%的数据集中与其他问题同时出现。无关特征与不平衡问题几乎无处不在,而类别重叠则是最具持续危害性的问题。我们识别出稳定的临界点:类别重叠约为0.20,不平衡为0.65-0.70,无关性为0.94,超过这些阈值大多数模型性能开始下降。我们还揭示了反直觉的模式,例如当无关特征较少时异常值反而提升性能,这凸显了情境感知评估的重要性。最后,我们揭示了性能与鲁棒性之间的权衡:没有任何单一学习器能在所有条件下均占优。通过联合分析普遍性、共现性、阈值与条件效应,本研究直接弥补了SDP研究中长期存在的空白。因此,研究超越了孤立分析,为理解现实场景中质量问题如何影响模型性能提供了整体性、数据感知的视角。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员