In the past couple of decades, significant research efforts are devoted to the prediction of software bugs (i.e., defects). These works leverage a diverse set of metrics, tools, and techniques to predict which classes, methods, lines, or commits are buggy. However, most existing work in this domain treats all bugs the same, which is not the case in practice. The more severe the bugs the higher their consequences. Therefore, it is important for a defect prediction method to estimate the severity of the identified bugs, so that the higher severity ones get immediate attention. In this paper, we provide a quantitative and qualitative study on two popular datasets (Defects4J and Bugs.jar), using 10 common source code metrics, and also two popular static analysis tools (SpotBugs and Infer) for analyzing their capability in predicting defects and their severity. We studied 3,358 buggy methods with different severity labels from 19 Java open-source projects. Results show that although code metrics are powerful in predicting the buggy code (Lines of the Code, Maintainable Index, FanOut, and Effort metrics are the best), they cannot estimate the severity level of the bugs. In addition, we observed that static analysis tools have weak performance in both predicting bugs (F1 score range of 3.1%-7.1%) and their severity label (F1 score under 2%). We also manually studied the characteristics of the severe bugs to identify possible reasons behind the weak performance of code metrics and static analysis tools in estimating the severity. Also, our categorization shows that Security bugs have high severity in most cases while Edge/Boundary faults have low severity. Finally, we show that code metrics and static analysis methods can be complementary in terms of estimating bug severity.
翻译:在过去几十年中,大量研究工作都致力于预测软件错误(即缺陷)的强度。这些工程利用了多种衡量标准、工具和技术来预测哪些类别、方法、线条或行为是错误。然而,这个领域的多数现有工作对待所有的错误都是一样的,实际上情况并非如此。错误的后果越严重,其后果就越严重。因此,对于一个缺陷预测方法来说,估计所查明的错误的严重性非常重要,以便更严重者立即得到注意。在本文中,我们提供了一套关于两种流行数据集(Defects4J和Bugs.jar)的定量和定性研究,以预测哪些类别、方法、方法、方法、方法、方法、方法、方法或方法。然而,我们研究了3,358种错误方法,其严重程度与19 Java 公开源项目不同。结果显示,虽然代码指标在预测错误代码(代码Line of the cod, 可维持的指数1和Bugs) 的弱点分析,1 其精确性能分析的精确性能水平也是我们所观察到的。