In the field of data integration, data quality problems are often encountered when extracting, combining, and merging data. The probabilistic data integration approach represents information about such problems as uncertainties in a probabilistic database. In this paper, we propose a data-cleaning autoencoder capable of near-automatic data quality improvement. It learns the structure and dependencies in the data to identify and correct doubtful values. A theoretical framework is provided, and experiments show that it can remove significant amounts of noise from categorical and numeric probabilistic data. Our method does not require clean data. We do, however, show that manually cleaning a small fraction of the data significantly improves performance.
翻译:在数据整合领域,在提取、合并和合并数据时往往会遇到数据质量问题。概率数据整合方法代表了概率数据库不确定性等问题的信息。在本文件中,我们提出一个能够近自动数据质量改进的数据清理自动编码器。它了解数据的结构和依赖性,以便识别和纠正可疑值。提供了理论框架,实验表明它能够从绝对和数字概率数据中消除大量噪音。我们的方法不需要清洁数据。然而,我们确实表明,人工清理数据中的一小部分能显著改善性能。