Defect prediction has been a popular research topic where machine learning (ML) and deep learning (DL) have found numerous applications. However, these ML/DL-based defect prediction models are often limited by the quality and size of their datasets. In this paper, we present Defectors, a large dataset for just-in-time and line-level defect prediction. Defectors consists of $\approx$ 213K source code files ($\approx$ 93K defective and $\approx$ 120K defect-free) that span across 24 popular Python projects. These projects come from 18 different domains, including machine learning, automation, and internet-of-things. Such a scale and diversity make Defectors a suitable dataset for training ML/DL models, especially transformer models that require large and diverse datasets. We also foresee several application areas of our dataset including defect prediction and defect explanation. Dataset link: https://doi.org/10.5281/zenodo.7708984
翻译:缺陷预测一直是热门的研究课题,机器学习(ML)和深度学习(DL)在其中有着众多的应用。然而,这些基于ML /DL 的缺陷预测模型常常受数据集质量和规模的限制。在本文中,我们提出了Defectors——一个用于即时和行级缺陷预测的大规模数据集。Defectors包括近213K个源代码文件(近93K个有缺陷的及近120K个无缺陷的),跨越了24个流行的Python项目。这些项目来自18个不同的领域,包括机器学习、自动化和物联网等。这样的规模和多样性使得Defectors成为训练ML/DL模型、尤其是需要大规模和多样化数据集的Transformer模型的适合数据集。我们还预见到数据集的几个应用领域,包括缺陷预测和缺陷解释。数据集链接:https://doi.org/10.5281/zenodo.7708984