Modern artificial intelligence (AI) applications require large quantities of training and test data. This need creates critical challenges not only concerning the availability of such data, but also regarding its quality. For example, incomplete, erroneous or inappropriate training data can lead to unreliable models that produce ultimately poor decisions. Trustworthy AI applications require high-quality training and test data along many dimensions, such as accuracy, completeness, consistency, and uniformity. We explore empirically the correlation between six of the traditional data quality dimensions and the performance of fifteen widely used ML algorithms covering the tasks of classification, regression, and clustering, with the goal of explaining ML results in terms of data quality. Our experiments distinguish three scenarios based on the AI pipeline steps that were fed with polluted data: polluted training data, test data, or both. We conclude the paper with an extensive discussion of our observations and recommendations, alongside open questions and future directions to be explored.
翻译:现代人工智能(AI)应用需要大量的培训和测试数据,这种需要不仅在提供这类数据方面,而且在数据质量方面都构成重大挑战,例如,不完整、错误或不当的培训数据可能导致模型不可靠,最终导致决策不力。可靠的AI应用要求从许多方面进行高质量的培训和测试数据,如准确性、完整性、一致性和统一性。我们从经验上探讨传统数据质量的六个方面与15种广泛使用的ML算法的运行情况之间的相互关系,这些算法涉及分类、回归和组合等任务,目的是从数据质量方面解释ML结果。我们的实验根据AI编审中的步骤区分了三种情景,即污染的培训数据、测试数据或两者。我们最后广泛讨论了我们的意见和建议,并探讨了有待探讨的开放问题和未来方向。