基于树结构模式Web信息抽取的关键问题研究

项目名称： 基于树结构模式Web信息抽取的关键问题研究

项目编号： No.61273297

项目类型： 面上项目

立项/批准年度： 2013

项目学科： 自动化技术、计算机技术

项目作者： 吴共庆

作者单位： 合肥工业大学

项目金额： 80万元

中文摘要： Web数据的海量、动态、异构等特点，使得传统的信息抽取模型和算法面临精度、自动化程度、通用性和时空性能等方面的挑战。鉴于网页解析树结构模式对语言不敏感，具有易定位、可演化、可迁移的优点，本课题重点开展使用树结构模式的Web信息抽取的关键问题研究。通过深入分析Web数据源的特点，研究适用于Web信息抽取的树结构模式表示模型。研究具有强区分定界能力的树结构模式发现问题，寻求快速有效的抽取模式树挖掘方法，并研究网页结构动态变化环境下的变化检测方法和抽取模式树演化机制和方法。另外，为了提高获取新的未标注Web数据源抽取模式的自动化程度，研究模式树可迁移性问题以及知识迁移机制与方法。在上述工作基础上，面向Web服务应用领域，构建基于树结构模式挖掘的Web信息抽取问题求解原型系统，以现实的中文、英文、日文等语言的新闻网页数据、Web表格数据等为数据源，检验所提理论与方法的合理性与可行性。

中文关键词： Web信息抽取；树模式挖掘；树结构特征；在线抽取；Web大数据

英文摘要： Web pages are massive, dynamic and heterogeneous, and these characteristics bring challenges to traditional information extraction models and algorithms in accuracy, degree of automation, versatility and space/time performance. As tree patterns of a Web page parsing tree are not sensitive to the language used for Web contents, these patterns have the advantages of being easy to locate, and able to evolve and transfer. This project aims at key issues of Web information extraction using tree patterns. Through in-depth analysis for extensive cases of Web data sources, tree patterns suitable for extracting Web information will be studied. How to mine tree patterns with a strong distinguishing ability will be investigated. On these foundations, efficient and effective extraction methods will be designed, and then change detection and evolution mechanisms will be proposed for the extracted pattern tree under a dynamically changing environment. In addition, transferring mechanisms and algorithms will be developed on tree patterns in order to improve the degree of automation of acquiring patterns from unlabeled Web data sources. Along with the above research issues, a Web information extraction prototype system will be implemented based on tree pattern mining for Web service applications, to demonstrate the soundness an

英文关键词： Web Information Extraction；Tree Pattern Mining；Tree Structure Feature；Online Extraction；Web Big Data

成为VIP会员查看完整内容