项目名称: 基于距离的数据约束规则研究
项目编号: No.61202008
项目类型: 青年科学基金项目
立项/批准年度: 2013
项目学科: 计算机科学学科
项目作者: 宋韶旭
作者单位: 清华大学
项目金额: 25万元
中文摘要: 针对数据质量的需求日益迫切,基于距离的数据约束规则在数据冲突检测、数据一致性分析、数据修复等数据质量应用中具有重要作用。本项目拟研究距离约束规则的自动挖掘机制,并探讨距离约束规则在数据修复中的实践方法。其中针对距离约束规则的挖掘问题,提出无参数的距离阈值确定方法,并设计距离阈值计算算法的性能优化技术。通过研究距离约束规则的挖掘方法,能够为数据质量领域的应用提供理论依据和技术基础。其中,本项目重点研究距离约束规则在数据修复中的实际应用。通过理论分析,探讨基于距离约束规则的数据修复问题复杂度和技术难点,并提出基于安全收缩的有效近似修复方法。研究结果将通过实验进行验证。距离约束规则的自动挖掘和数据修复技术将提高数据的质量和可信度,促进我国可信软件的部署与发展。
中文关键词: 数据约束规则;距离约束规则;数据依赖关系;数据修复;数据质量
英文摘要: As data quality becomes a key issue in practice, the metric distance constraints are often deployed to improve the quality of data, such as detecting violations, analyzing consistencies, repairing dirty data and so on. In this proposal, we focus on the automatic discovery of metric distance constraints, as well as their application in the important data repairing problem. First, to find metric distance constraints automatically, we propose the parameter-free mining of distance thresholds. Advanced pruning techniques are also carefully designed to optimize the discovery process. Once the metric distance constraints are obtained by mining methods, we can investigate the foundations and techniques for applying them in solving data quality problems. In particular, we study the application of metric distance constraints in data repairing. The complexity and hardness of the repairing problem are first analyzed with theoretical proofs. Recognizing the hardness, we thereby develop a safe contraction based algorithm for approximate repairing. All the proposed approaches are evaluated through an extensive experiment. To our best knowledge, this is the first work on mining and repairing with respect to metric distance constraints. We believe that our proposal can improve the quality and reliability of data, and contribute
英文关键词: Data Constraints;Metric Distance Constraints;Data Dependencies;Data Repairing;Data Quality