Context: Software engineering researchers have undertaken many experiments investigating the potential of software defect prediction algorithms. Unfortunately, some widely used performance metrics are known to be problematic, most notably F1, but nevertheless F1 is widely used. Objective: To investigate the potential impact of using F1 on the validity of this large body of research. Method: We undertook a systematic review to locate relevant experiments and then extract all pairwise comparisons of defect prediction performance using F1 and the un-biased Matthews correlation coefficient (MCC). Results: We found a total of 38 primary studies. These contain 12,471 pairs of results. Of these, 21.95% changed direction when the MCC metric is used instead of the biased F1 metric. Unfortunately, we also found evidence suggesting that F1 remains widely used in software defect prediction research. Conclusions: We reiterate the concerns of statisticians that the F1 is a problematic metric outside of an information retrieval context, since we are concerned about both classes (defect-prone and not defect-prone units). This inappropriate usage has led to a substantial number (more than one fifth) of erroneous (in terms of direction) results. Therefore we urge researchers to (i) use an unbiased metric and (ii) publish detailed results including confusion matrices such that alternative analyses become possible.
翻译:软件工程研究人员进行了许多实验,调查软件缺陷预测算法的潜力。不幸的是,一些广泛使用的性能指标已知存在问题,其中最突出的是F1,但F1被广泛使用。目标:调查使用F1对大量研究的有效性的潜在影响。方法:我们进行了系统审查,以确定相关实验,然后利用F1和无偏见的Matthews相关系数(MCC)对缺陷预测性能进行所有对等比较。结果:我们发现共有38项初级研究,这些研究的结果有12 471对。其中21.95%在使用MCC标准而不是偏差的F1标准时改变了方向。不幸的是,我们还发现有证据表明F1在软件缺陷预测研究中仍然广泛使用。结论:我们重申统计人员的关切,F1在信息检索范围以外是一个问题指标,因为我们对两类(易感染性和不易发生缺陷的单位)都感到关切。这种不当使用导致大量(超过五分之一)错误的结果(在方向方面),导致21.95%的方向改变方向。我们敦促研究人员(在使用这种分析时,包括详细的分析结果)采用不偏向性模型。因此,我们敦促研究人员(一)采用详细的分析,包括可能的矩阵。