项目名称: 大数据错误检测与修复关键技术的研究
项目编号: No.61472099
项目类型: 面上项目
立项/批准年度: 2015
项目学科: 自动化技术、计算机技术
项目作者: 王宏志
作者单位: 哈尔滨工业大学
项目金额: 82万元
中文摘要: 大数据在当前的应用中广泛存在,成为数据管理研究的热点之一。由于其规模性、高速性、多样性的特点,大数据中有更大可能出现错误,即存在不一致、过时、不完整、不精确的数据或描述同一实体的数据出现冲突(简称为实体不同一)。能否有效检测和修复错误是以数据为中心的系统成败的重要因素。然而,由于可扩展性不足、缺少对多类别错误的支持、缺乏知识等原因,当前错误检测和修复技术难以应用于大数据。故本项目基于课题组的研究基础,研究大数据上检测并修复错误的关键技术。本项目拟面向大数据提出计算有效的数据质量模型,针对不一致、过时、不完整、不精确和实体不同一这五类数据错误分别提出适用于大数据的错误检测与修复算法,提出大数据上多种类型混合错误的检测与修复方法,并开发一套大数据错误检测与修复系统,验证研究结果的正确性和有效性。
中文关键词: 数据库;大数据;数据质量;数据管理;数据清洗
英文摘要: Many applications contain big data. Big data management becomes one of the hot topics in data management field. Big data contain errors in higher possibility due to the features of volume, velocity and variety. Here, error means inconsistent, outdated, incomplete, inaccurate data or conflicts in the data referring to the same entity (conflicts for brief). Detecting and repairing errors effectively are essential for data-centric systems. However, existing error detection and repair technologies could not be applied to big data due to low scalability, not supporting mixed multiple error types and lack of knowledge. Therefore, this project attempts to study key technologies of error detection and repair for big data on the basis of our existing work. This project will design computation-efficient data quality model for big data, present algorithms to detect and repair inconsistency, outdating, incompleteness, inaccuracy and conflicts in big data respectively, propose detection and repair methods for mixed errors in multiple types in big data, and develop an error detection and repair system for big data to verify the correctness and effectiveness of proposed theories and techniques.
英文关键词: database;big data;data quality;data management;data cleaning