Machine learning (ML) has been widely used in the literature to automate software engineering tasks. However, ML outcomes may be sensitive to randomization in data sampling mechanisms and learning procedures. To understand whether and how researchers in SE address these threats, we surveyed 45 recent papers related to three predictive tasks: defect prediction (DP), predictive mutation testing (PMT), and code smell detection (CSD). We found that less than 50% of the surveyed papers address the threats related to randomized data sampling (via multiple repetitions); only 8% of the papers address the random nature of ML; and parameter values are rarely reported (only 18% of the papers). To assess the severity of these threats, we conducted an empirical study using 26 real-world datasets commonly considered for the three predictive tasks of interest, considering eight common supervised ML classifiers. We show that different data resamplings for 10-fold cross-validation lead to extreme variability in observed performance results. Furthermore, randomized ML methods also show non-negligible variability for different choices of random seeds. More worryingly, performance and variability are inconsistent for different implementations of the conceptually same ML method in different libraries, as also shown through multi-dataset pairwise comparison. To cope with these critical threats, we provide practical guidelines on how to validate, assess, and report the results of predictive methods.
翻译:在文献中,机器学习(ML)被广泛用于软件工程任务自动化的文献中。然而,ML结果可能对数据抽样机制和学习程序中的随机化敏感。为了了解SE研究人员是否以及如何应对这些威胁,我们调查了与三种预测任务(缺陷预测(DP)、预测突变测试(PMT)和代码嗅觉(CSD))有关的45份最新论文。我们发现,不到50%的被调查论文涉及随机化数据抽样(通过多次重复)的威胁;只有8%的论文涉及随机性;参数值很少报告(只有18%的文件)。为了评估这些威胁的严重性,我们进行了一项经验性研究,使用通常考虑的26个真实世界数据集来应对三种预测性任务:缺陷预测(DP)、预测性突变测试(PMMTT)和代码嗅觉检测(CSDDD)。我们发现,10倍交叉校验结果的不同数据抽样显示,观察到的性能结果极不易变。此外,随机化的ML方法也显示随机性变异性;更令人担忧的是,业绩和变异性数值对于不同执行不同概念性的报告来说是不一致的,我们通过多种数据验证方法来比较。