争取对缺陷预测模型进行一致的业绩评价 (Toward a consistent performance evaluation for defect prediction models)

In defect prediction community, many defect prediction models have been proposed and indeed more new models are continuously being developed. However, there is no consensus on how to evaluate the performance of a newly proposed model. In this paper, we aim to propose MATTER, a fraMework towArd a consisTenT pErformance compaRison, which makes model performance directly comparable across different studies. We take three actions to build a consistent evaluation framework for defect prediction models. First, we propose a simple and easy-to-use unsupervised baseline model ONE (glObal baseliNe modEl) to provide "a single point of comparison". Second, we propose using the SQA-effort-aligned threshold setting to make a fair comparison. Third, we suggest reporting the evaluation results in a unified way and provide a set of core performance indicators for this purpose, thus enabling an across-study comparison to attain real progress. The experimental results show that MATTER can serve as an effective framework to support a consistent performance evaluation for defect prediction models and hence can help determine whether a newly proposed defect prediction model is practically useful for practitioners and inform the real progress in the road of defect prediction. Furthermore, when applying MATTER to evaluate the representative defect prediction models proposed in recent years, we find that most of them (if not all) are not superior to the simple baseline model ONE in terms of the SQA-effort awareness prediction performance. This reveals that the real progress in defect prediction has been overestimated. We hence recommend that, in future studies, when any new defect prediction model is proposed, MATTER should be used to evaluate its actual usefulness (on the same benchmark test data sets) to advance scientific progress in defect prediction.

翻译：在缺陷预测界,提出了许多缺陷预测模型,而且正在不断开发更多的新模型,然而,在如何评价新提议模型的绩效方面没有达成共识。在本文件中,我们打算提出“MATTER,一个FAMWTWWAFFFMWWWWFFFFMUWWWFAFFTFAFTTTTENTTPExferforforforforforforforForest Comari预测模型,使模型的绩效在各不同研究中直接可比;我们采取三项行动,为缺陷预测模型建立一个一致的评价框架。首先,我们提议建立一个简单、容易使用、容易使用、更新的模型来为缺陷预测模型(gloobbalbalbal Basilli 基础模型(gObalbalbalbalbal bbal b) 提供“单一点比较点”以提供“单一比较点”。第二,我们提议使用SQAA-er-e-eff-e-er-erg 门槛阈值阈值阈值阈值阈值阈值的阈值阈值设定模型来进行公正的实践实践实践实践实践实践实践实践实践实践实践实践实践实践实践实践实践,并告知路上的所有标准测试数据。我们使用的S-测试标准测试标准标准中,我们提出的SF里测测测测测测测测测测路中所有SB。