Reliable empirical models such as those used in software effort estimation or defect prediction are inherently dependent on the data from which they are built. As demands for process and product improvement continue to grow, the quality of the data used in measurement and prediction systems warrants increasingly close scrutiny. In this paper we propose a taxonomy of data quality challenges in empirical software engineering, based on an extensive review of prior research. We consider current assessment techniques for each quality issue and proposed mechanisms to address these issues, where available. Our taxonomy classifies data quality issues into three broad areas: first, characteristics of data that mean they are not fit for modeling; second, data set characteristics that lead to concerns about the suitability of applying a given model to another data set; and third, factors that prevent or limit data accessibility and trust. We identify this latter area as of particular need in terms of further research.
翻译:可靠的实证模型,如软件工作估计或缺陷预测中使用的可靠实证模型,必然取决于它们所依赖的数据。随着对流程和产品改进的需求不断增长,测量和预测系统使用的数据的质量需要越来越仔细地审查。在本文件中,我们根据对以往研究的全面审查,提出了实证软件工程数据质量挑战分类法。我们考虑了每个质量问题目前的评估技术,并提出了解决这些问题的机制。我们的分类法将数据质量问题分为三大领域:第一,数据特征意味着它们不适于建模;第二,数据集特征导致人们担心是否适宜将某一模型应用到另一数据集;第三,防止或限制数据获取和信任的因素。我们确定后一个领域特别需要进一步研究。