Software fault-proneness prediction is an active research area, with many factors affecting prediction performance extensively studied. However, the impact of the learning approach (i.e., the specifics of the data used for training and the target variable being predicted) on the prediction performance has not been studied, except for one initial work. This paper explores the effects of two learning approaches, useAllPredictAll and usePrePredictPost, on the performance of software fault-proneness prediction, both within-release and across-releases. The empirical results are based on data extracted from 64 releases of twelve open-source projects. Results show that the learning approach has a substantial, and typically unacknowledged, impact on the classification performance. Specifically, using useAllPredictAll leads to significantly better performance than using usePrePredictPost learning approach, both within-release and across-releases. Furthermore, this paper uncovers that, for within-release predictions, this difference in classification performance is due to different levels of class imbalance in the two learning approaches. When class imbalance is addressed, the performance difference between the learning approaches is eliminated. Our findings imply that the learning approach should always be explicitly identified and its impact on software fault-proneness prediction considered. The paper concludes with a discussion of potential consequences of our results for both research and practice.
翻译:软件故障易发性预测是一个积极的研究领域,有许多影响预测业绩的因素。然而,除了一项初步工作外,尚未研究学习方法(即用于培训的数据的具体细节和预测的目标变量)对预测业绩的影响(即培训使用的数据的具体特点和预测的目标变量)。本文探讨了使用AllPredictAll和使用PrevictAll和使用PrevictPost两种学习方法对软件故障易发性预测的绩效的影响,包括释放和跨排放的绩效。实验结果基于12个开放源码项目的64份发布数据得出的数据。结果显示学习方法对分类业绩的影响很大,而且通常不被承认。具体地说,使用AllPreprictAll会比使用PrevictPost内部和跨排放的学习方法带来显著的更好的业绩。此外,本文揭示,对于释放后预测的分类性业绩差异在于两种学习方法的不同程度。在解决课堂失衡问题时,对学习方法的绩效差异和对学习结果的判断总是意味着我们研究的结果。