With the rapid development of the internet technology, dirty data are commonly observed in various real scenarios, e.g., owing to unreliable sensor reading, transmission and collection from heterogeneous sources. To deal with their negative effects on downstream applications, data cleaning approaches are designed to preprocess the dirty data before conducting applications. The idea of most data cleaning methods is to identify or correct dirty data, referring to the values of their neighbors which share the same information. Unfortunately, owing to data sparsity and heterogeneity, the number of neighbors based on equality relationship is rather limited, especially in the presence of data values with variances. To tackle this problem, distance-based data cleaning approaches propose to consider similarity neighbors based on value distance. By tolerance of small variants, the enriched similarity neighbors can be identified and used for data cleaning tasks. At the same time, distance relationship between tuples is also helpful to guide the data cleaning, which contains more information and includes the equality relationship. Therefore, distance-based technology plays an important role in the data cleaning area, and we also have reason to believe that distance-based data cleaning technology will attract more attention in data preprocessing research in the future. Hence this survey provides a classification of four main data cleaning tasks, i.e., rule profiling, error detection, data repair and data imputation, and comprehensively reviews the state of the art for each class.
翻译:随着互联网技术的迅速发展,在各种真实的情景中,人们通常看到肮脏的数据,例如,由于传感器读、传输和从不同来源收集的不可靠,因此,在各种真实的情景中,经常看到肮脏的数据。为了应对其对下游应用的消极影响,设计了数据清理方法,以便在应用之前预先处理肮脏的数据。大多数数据清理方法的构想是查明或纠正肮脏的数据,参照共享相同信息的邻居的价值观。不幸的是,由于数据宽广和差异性,基于平等关系的邻居人数相当有限,特别是在存在数据值差异的情况下。为了解决这一问题,远程数据清理方法建议考虑基于价值距离的类似邻居。通过对小型变量的容忍,可以确定并使用丰富的相似邻居来进行数据清理任务。与此同时,塔普尔之间的距离关系也有助于指导数据清理,后者包含更多的信息,包括平等关系。因此,基于远程的技术在数据清理领域发挥着重要作用,我们有理由相信,基于远程的数据清理技术将在数据处理前研究中吸引更多关注,根据价值远程数据清理方法考虑。因此,为今后进行数据清理和全面分析提供一种数据分类。