Defect prediction has been a popular research topic where machine learning (ML) and deep learning (DL) have found numerous applications. However, these ML/DL-based defect prediction models are often limited by the quality and size of their datasets. In this paper, we present Defectors, a large dataset for just-in-time and line-level defect prediction. Defectors consists of $\approx$ 213K source code files ($\approx$ 93K defective and $\approx$ 120K defect-free) that span across 24 popular Python projects. These projects come from 18 different domains, including machine learning, automation, and internet-of-things. Such a scale and diversity make Defectors a suitable dataset for training ML/DL models, especially transformer models that require large and diverse datasets. We also foresee several application areas of our dataset including defect prediction and defect explanation. Dataset link: https://doi.org/10.5281/zenodo.7708984
翻译:机学(ML)和深层学习(DL)已发现许多应用,但基于ML/DL的缺陷预测模型往往因其数据集的质量和规模而受到限制。本文介绍的是一个广受欢迎的研究课题,其中机器学习(ML)和深层学习(DL)发现许多应用,但这些基于ML/DL的缺陷预测模型往往因其数据集的质量和规模而受到限制。本文介绍的是Deffectors,这是用于及时预测和直线一级缺陷预测的大型数据集。Dutectors由$approx$213K源代码文件组成($approx$93K缺陷和$\approx$120K无缺陷解释),覆盖了24个广受欢迎的Python项目。这些项目来自18个不同领域,包括机器学习、自动化和互联网。这种规模和多样性使Deffectors成为培训ML/DL模型的合适数据集,特别是需要大型和多样化数据集的变压器模型。我们还预见了我们数据集的若干应用领域,包括缺陷预测和解释。数据设置链接:https://doi.org/10.5281/770808/zenododo 78784。</s>