面向Web数据集成的半结构化Web数据自适应抽取与整合问题研究

项目名称： 面向Web数据集成的半结构化Web数据自适应抽取与整合问题研究

项目编号： No.61303007

项目类型： 青年科学基金项目

立项/批准年度： 2014

项目学科： 自动化技术、计算机技术

项目作者： 丁艳辉

作者单位： 山东师范大学

项目金额： 23万元

中文摘要： 面对互联网上海量、异构、有价值数据的大量涌现，Web数据集成通过有效地整合多数据源的数据，为诸如市场情报分析、舆情分析、商业智能等分析型应用提供重要的数据支撑。半结构化数据是Web数据的重要组成部分，半结构化Web数据的抽取与整合是Web数据集成的重要环节，存在着许多困难和挑战。本项目拟在Web数据集成的背景下，对半结构化Web数据的抽取与整合问题进行深入研究。拟针对Web数据集成动态性的特点，研究半结构化Web数据自适应抽取技术，充分利用Web数据集成系统中已集成的数据以及Web数据元素间的长距离依赖关系，实现对同一领域内新数据源的适应性抽取；研究重复记录检测与冲突消解的结合处理技术，充分利用Web数据集成系统中已集成数据及领域知识的指导作用，以及重复记录检测与冲突消解间的相互促进作用，建立适应Web数据特点的重复记录检测与冲突消解结合处理方法，提高Web数据整合方法的准确性和适应性。

中文关键词： Web数据集成；半结构化Web数据；数据抽取；数据整合；

英文摘要： With the huge number of heterogeneous and valuable data arising in the internet, Web data integration integrates multiple Web data sources and provides important data supporting for such applications as business intelligence, market intelligence, and so on. Semi-structured data is an important part of Web data. The extraction and integration of semi-structured Web data is the crucial step of Web data integration, which face many difficulties and challenges. The project intends to conduct in-depth research on the extraction and integration issues of semi-structured web data. Due to the characteristics that Web data integration is a dynamic process, adaptive extraction methods are studied in this project. The accumulated data in Web data integration system and the long-distance dependencies between Web elements are used to realize that the new data source in the same domain can be extracted adaptively. The techniques that deal with duplicate record detection and conflict resolution simultaneously are studied in this project. The accumulated data in Web data integration system and the domain knowledge, as well as the relationship between duplicated record detection and conflict resolution, are used to realize that both the two processes are solved simultaneously, which can improve the accuracy and adaptability of b

英文关键词： Web data integration；semi-structured Web data；data extration；data fusion；

成为VIP会员查看完整内容