Data originating from open-source software projects provide valuable information to enhance software quality. In the scope of Software Defect Prediction, one of the most challenging parts is extracting valid data about failure-prone software components from these repositories, which can help develop more robust software. In particular, collecting data, calculating metrics, and synthesizing results from these repositories is a tedious and error-prone task, which often requires understanding the programming languages involved in the mined repositories, eventually leading to a proliferation of language-specific data-mining software. This paper presents RepoMiner, a language-agnostic tool developed to support software engineering researchers in creating datasets to support any study on defect prediction. RepoMiner automatically collects failure data from software components, labels them as failure-prone or neutral, and calculates metrics to be used as ground truth for defect prediction models. We present its implementation and provide examples of its application.
翻译:开放源软件项目产生的数据为提高软件质量提供了宝贵的信息。在软件缺陷预测的范围内,最具挑战性的部分之一是从这些储存库中提取关于易发生故障的软件组件的有效数据,这些数据有助于开发更强的软件。特别是,收集数据、计算测量尺度和综合这些储存库的结果是一项乏味和容易出错的任务,这往往要求理解雷区中涉及的编程语言,最终导致语言特定数据挖掘软件的扩散。本文介绍了RepoMiner,这是为支持软件工程研究人员创建数据集以支持任何缺陷预测研究而开发的一种语言敏感工具。RepoMiner自动从软件组件中收集故障数据,将其标为易发生故障或中性数据,并计算指标,作为缺陷预测模型的地面真象。我们介绍其实施情况并提供应用实例。