项目名称: 基于树结构模式挖掘的Web信息抽取研究
项目编号: No.61005044
项目类型: 青年科学基金项目
立项/批准年度: 2011
项目学科: 轻工业、手工业
项目作者: 吴共庆
作者单位: 合肥工业大学
项目金额: 7万元
中文摘要: 围绕基于树结构模式挖掘的Web信息抽取问题,从模型设计、问题描述及其复杂性分析、算法设计与分析,到实际领域的应用,开展了系统地研究,取得了丰硕的阶段性成果。设计了基于树编辑距离度量的Web表格抽取算法,相对于字符串编辑距离度量,有效地提高了Web表格抽取的精度。设计了一种特殊的树结构模式-区分路径模式,提出了能否仅用路径模式精准抽取Web新闻的研究问题?围绕该问题,我们完成了节点级Web新闻语料标注工具开发和语料标注工作,对基于区分路径模式的Web信息抽取模型、模式发现问题及其复杂性分析、模式发现问题的求解算法开展了研究,并面向现实的数据集开展了实证研究。在上述工作的基础上,我们完成了个性化的Web新闻过滤和总结系统的应用研究,面向抽取文本处理过程中带有通配符的模式匹配、用户兴趣建模、基于语义分析的特征降维、高性能聚类等相关问题开展了探索性研究。受本课题资助,已发表SCI收录论文1篇、国际会议论文5篇(已被EI收录3篇),已录用国内核心期刊论文1篇,已投稿SIGKDD-2012论文1篇,获ICTAI -2011最优论文奖1项,为进一步深入地研究奠定了坚实的基础。
中文关键词: 数据挖掘; Web信息抽取; 树编辑距离; 树结构模式挖掘;
英文摘要: We have performed a systematic investigation on the Web information extraction problem based on tree structure pattern mining, from the model design, problem description, complexity analysis, algorithm design and analysis to its real-world applications. Some significant achievements have been made in this project. A Web table extraction algorithm based on a tree edit distance metric has been designed and verified to be more efficient on extraction scores than the string edit metric. Can we accurately extract Web news using only path patterns? This is the research problem proposed in this project. Aiming at this problem, a special tree structure pattern - the distinguishing path pattern has been designed for Web information extraction, and a node level annotation tool has been developed for annotating Web news corpora. A Web extraction model using path patterns, a distinguishing path pattern mining problem and its complexity, and algorithms for discovering distinguishing path patterns are studied thoroughly. Our empirical studies have been conducted on real data sets. Based on these significant achievements, we have completed a personalized Web news filtering and summary system. Oriented towards processing the extracted text, pattern matching with wildcards, user interest modeling, dimension reduction based on semantic analysis, and high-performance clustering are also carried out for exploratory studies. Supported by this grant, we have generated 1 SCI-indexed journal article, 5 international conference papers (3 of which are EI-indexed). 1 article indexed by Chinese core journals has been accepted, and 1 article has been submitted to SIGKDD-2012. We were also awarded the Best Paper Award of ICTAI-2011. All of these achievements have laid a solid foundation for further in-depth research.
英文关键词: data mining; Web information extraction; tree edit distance; tree structure pattern mining