Data quality issues have attracted widespread attention due to the negative impacts of dirty data on data mining and machine learning results. The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate algorithm with the consideration of data quality and the determination of the data share to clean. However, rare research has focused on exploring such relationship. Motivated by this, this paper conducts an experimental comparison for the effects of missing, inconsistent and conflicting data on classification and clustering algorithms. Based on the experimental findings, we provide guidelines for algorithm selection and data cleaning.
翻译:由于肮脏数据对数据挖掘和机器学习结果的负面影响,数据质量问题引起广泛关注,数据质量和结果准确性之间的关系可适用于选择适当的算法,同时考虑到数据质量和确定数据共享是否干净,然而,很少的研究侧重于探索这种关系,因此,本文件对缺少、不一致和相互矛盾的数据对分类和组群算法的影响进行了实验性比较,根据实验结果,我们为算法选择和数据清理提供了准则。