Nowadays, machine learning plays a key role in developing plenty of applications, e.g., smart homes, smart medical assistance, and autonomous driving. A major challenge of these applications is preserving high quality of the training and the serving data. Nevertheless, existing data cleaning methods cannot exploit context information. Thus, they usually fail to track shifts in the data distributions or the associated error profiles. To overcome these limitations, we introduce, in this paper, a novel method for automated tabular data cleaning powered by dynamic functional dependency rules extracted from a live context model. As a proof of concept, we create a smart home use case to collect data while preserving the context information. Using two different data sets, our evaluations show that the proposed cleaning method outperforms a set of baseline methods in terms of the detection and repair accuracy.
翻译:目前,机器学习在开发大量应用软件方面发挥着关键作用,例如智能家庭、智能医疗援助和自主驾驶。这些应用的主要挑战是如何保持高质量的培训和服务数据。然而,现有的数据清理方法无法利用背景信息。因此,它们通常无法跟踪数据分配或相关错误剖面的变化。为了克服这些限制,我们在本文件中引入了一种新型的自动化表格数据清理方法,该方法以动态功能依赖规则为动力,从一个现场环境模型中提取。作为概念的证明,我们创建了一个智能家庭使用案例,以收集数据,同时保存背景信息。我们的评估用两种不同的数据集显示,拟议的清洁方法在探测和修复准确性方面超越了一套基线方法。