项目名称: 基于众包的数据清洗关键技术研究
项目编号: No.61472198
项目类型: 面上项目
立项/批准年度: 2015
项目学科: 自动化技术、计算机技术
项目作者: 冯建华
作者单位: 清华大学
项目金额: 86万元
中文摘要: 在信息化高速发展的今天,数据在各行各业发挥着越来越重要的地位,例如:数据分析常常可以帮助企业在市场上做出正确的商业决策。然而,如果数据不干净,那么基于脏数据所做的分析结果可能会导致完全错误的商业决策,给公司带来巨大的损失。根据益百利公司的最新调查结果表明,2011年英国的商业公司因为数据不干净的问题总共损失高达80亿英镑。为了清洗这些脏数据,基于机器算法的数据清洗技术已经得到了广泛关注,但是目前方法还不能达到满意的效果。最近几年,众包技术在工业界和学术界获得了广泛的关注,并被验证可以比精巧的机器算法获得更好的结果。受此启发,本课题研究基于众包的数据清洗技术,具体研究内容包括:(1)众包数据错误检测;(2)众包数据错误修复;(3)众包冗余数据去重;(4)众包数据清洗结果的质量控制。此外,我们还将把以上研究成果融为一体,开发一套比当前主流的数据清洗系统结果更好的众包数据清洗系统。
中文关键词: 众包;数据清洗;质量控制;数据修复;数据冗余
英文摘要: With the increasing development of information technology, data plays a more and more important role in our daily life. As an example, data analysis can help enterprises make a better decision in the market. However, if data is not clean, the analysis based on dirty data may lead to completely wrong decisions, which may cause enormous losses to enterprises. According to a recent study from Experian QAS Inc., poor customer data cost British businesses $8 billion loss of revenue in 2011. In order to clean the dirty data, machine-based data cleaning approaches have been widely studied for several decades, but still remain far from perfect. Recently, crowdsourcing has attracted significant attention in both the industrial and academic communities. It is widely validated that crowdsourcing can obtain better results than sophisticated machine-based approaches. This insight motivates us to explore crowdsourced data-cleaning approaches. In particular, in this proposal, we mainly study the following four problems: (1) Crowdsourced data error detection; (2) Crowdsourced dirty data repairing; (3) Crowdsourced duplicate data detection; (4) Quality control of crowdsourced data-cleaning results. In addition, we will develop a real crowdsourced data-cleaning system by combining all research achievements, which aims to outperform the state-of-the-art machine-based data-cleaning systems in terms of result accuracy.
英文关键词: Crowdsourcing;Data Cleaning;Quality Control;Data Repair;Data Redundancy