Context: The utility of prediction models in empirical software engineering (ESE) is heavily reliant on the quality of the data used in building those models. Several data quality challenges such as noise, incompleteness, outliers and duplicate data points may be relevant in this regard. Objective: We investigate the reporting of three potentially influential elements of data quality in ESE studies: data collection, data pre-processing, and the identification of data quality issues. This enables us to establish how researchers view the topic of data quality and the mechanisms that are being used to address it. Greater awareness of data quality should inform both the sound conduct of ESE research and the robust practice of ESE data collection and processing. Method: We performed a targeted literature review of empirical software engineering studies covering the period January 2007 to September 2012. A total of 221 relevant studies met our inclusion criteria and were characterized in terms of their consideration and treatment of data quality. Results: We obtained useful insights as to how the ESE community considers these three elements of data quality. Only 23 of these 221 studies reported on all three elements of data quality considered in this paper. Conclusion: The reporting of data collection procedures is not documented consistently in ESE studies. It will be useful if data collection challenges are reported in order to improve our understanding of why there are problems with software engineering data sets and the models developed from them. More generally, data quality should be given far greater attention by the community. The improvement of data sets through enhanced data collection, pre-processing and quality assessment should lead to more reliable prediction models, thus improving the practice of software engineering.
翻译:在实际软件工程(ESE)中,预测模型的有用性在很大程度上取决于在建立这些模型时所使用的数据的质量。在这方面,若干数据质量挑战可能具有相关性,例如噪音、不完全性、外部值和重复数据点。目标:我们调查欧洲软件工程研究中数据质量三个潜在有影响要素的报告:数据收集、数据处理预处理和确定数据质量问题。这使我们能够确定研究人员如何看待数据质量专题和正在用来解决这一问题的机制。提高数据质量的认识应既为欧洲软件工程研究的正确开展,又为欧洲软件工程数据收集和处理的有力做法提供信息。方法:我们对2007年1月至2012年9月期间的经验软件工程研究进行了有针对性的文献审查。我们共221项相关研究符合我们的列入标准,在数据质量的考虑和处理方面得到了特征。结果:我们获得了有用的见解,研究人员如何看待数据质量的这三个要素。在这221项研究中,只有23项研究报告了数据质量的所有三个要素。结论:数据采集的汇报程序没有经过更精确的文献记录,因此,在ESE软件质量研究中,如果通过更深入的数据收集方法报告,那么,就应该有更有用的数据系统数据的收集将是有用的问题。