This paper proposes a supervised machine learning approach for predicting the root cause of a given bug report. Knowing the root cause of a bug can help developers in the debugging process - either directly or indirectly by choosing proper tool support for the debugging task. We mined 54755 closed bug reports from the issue trackers of 103 GitHub projects and applied a set of heuristics to create a benchmark consisting of 10459 reports. A subset was manually classified into three groups (semantic, memory, and concurrency) based on the bugs' root causes. Since the types of root cause are not equally distributed, a combination of keyword search and random selection was applied. Our data set for the machine learning approach consists of 369 bug reports (122 concurrency, 121 memory, and 126 semantic bugs). The bug reports are used as input to a natural language processing algorithm. We evaluated the performance of several classifiers for predicting the root causes for the given bug reports. Linear Support Vector machines achieved the highest mean precision (0.74) and recall (0.72) scores. The created bug data set and classification are publicly available.
翻译:本文提出一种监督的机器学习方法, 用于预测某个错误报告的根本原因。 了解错误的根源可以帮助调试过程的开发者―― 直接或间接地选择调试任务的适当工具支持。 我们从103 GitHub 项目的发行跟踪器中提取了54755份关闭的错误报告, 并运用一套超常学来创建由 10459 份报告组成的基准。 一个子集被手工分类成基于错误根源原因的三个组( 静态、 内存和同值计算 ) 。 由于根源类型分布不均, 应用了关键词搜索和随机选择的组合。 我们的机器学习方法数据集包含 369 个错误报告( 122 conconconconconform、 121 内存和 126 semantic 错误)。 错误报告被用作自然语言处理算法的输入。 我们评估了数个分类器的性能, 用于预测给定错误报告的根源。 线性支持 Vector 机器达到了最高平均值 (0. 74) 并忆及 (0. 72) 。 创建的错误数据集和分类是公开的 。