对多版本项目缺陷数据集中不一致标签标签的广泛的实证研究 (An extensive empirical study of inconsistent labels in multi-version-project defect data sets)

The label quality of defect data sets has a direct influence on the reliability of defect prediction models. In this study, for multi-version-project defect data sets, we propose an approach to automatically detecting instances with inconsistent labels (i.e. the phenomena of instances having the same source code but different labels over multiple versions of a software project) and understand their influence on the evaluation and interpretation of defect prediction models. Based on five multi-version-project defect data sets (either widely used or the most up-to-date in the literature) collected by diverse approaches, we find that: (1) most versions in the investigated defect data sets contain inconsistent labels with varying degrees; (2) the existence of inconsistent labels in a training data set may considerably change the prediction performance of a defect prediction model as well as can lead to the identification of substantially different true defective modules; and (3) the importance ranking of independent variables in a defect prediction model can be substantially shifted due to the existence of inconsistent labels. The above findings reveal that inconsistent labels in defect data sets can profoundly change the prediction ability and interpretation of a defect prediction model. Therefore, we strongly suggest that practitioners should detect and exclude inconsistent labels in defect data sets to avoid their potential negative influence on defect prediction models. What is more, it is necessary for researchers to improve existing defect label collection approaches to reduce inconsistent labels. Furthermore, there is a need to re-examine the experimental conclusions of previous studies using multi-version-project defect data sets with a high ratio of inconsistent labels.

翻译：缺陷数据集的标签质量直接影响到缺陷预测模型的可靠性。在本研究中,关于多版本项目缺陷数据集,我们提出一种办法,以自动检测标签不一致的情况(即软件项目多种版本的源代码不同,但标签不同),并理解其对缺陷预测模型的评估和解释的影响。根据五套多版本项目缺陷数据集(广泛使用或文献中最新)收集的不同方法,我们发现:(1) 所调查的缺陷数据集中的大多数版本含有不同程度的不一致标签;(2) 培训数据集中存在不一致的标签可能大大改变缺陷预测模型的预测性能,并可能导致识别基本不同的真实缺陷模块;(3) 缺陷预测模型中独立变量的排序可能因存在不一致标签而大为改变。上述调查结果表明,缺陷数据集中的不一致标签可以深刻改变缺陷预测能力和缺陷预测模型的解释。因此,我们强烈建议,在培训数据集中存在不一致的标签可能会大大改变缺陷预测模型的预测性能,因此,在收集缺陷数据集方面采用更不连贯的标签需要改进。

相关内容