项目名称: 面向实体信息集成的非合作半结构化深网数据源选择
项目编号: No.61462037
项目类型: 地区科学基金项目
立项/批准年度: 2015
项目学科: 自动化技术、计算机技术
项目作者: 邓松
作者单位: 江西财经大学
项目金额: 44万元
中文摘要: Web环境下,大部分深网是半结构化的且不能用传统爬虫获取其中内容,因此本项目以非合作半结构化深网数据源为研究对象。由于实体信息集成是深网集成的重要研究内容,因此以满足用户实体集成检索的需求,提高Web深网集成效率为目标,研究相应的半结构化深网源选择方法。研究内容包括深网抽样、摘要构建、摘要更新、选择策略。项目针对已有深网抽样采样数据代表性不强等问题,鉴于半结构化深网实体数据特点,提出了基于图和随机游走的半结构化深网抽样技术;针对实体信息整合与实体关联挖掘的集成检索需求,提出了基于文档与特征词标注的摘要构建方法,并结合以上需求特点分别提出了基于兴趣度和新颖度的数据源选择方法、基于实体特征传递关系的数据源选择方法;为了减少数据源摘要更新工作量,基于同领域数据源数据更新特点,提出了基于抽样和实体扩展的局部摘要更新策略。该研究有助于提升半结构化深网集成效率,具有较好的应用前景。
中文关键词: 实体;非合作;半结构化;深网;数据源选择
英文摘要: In web environment, most of deep webs are semi-structured, and web crawler can not get the content of them. Therefore, non-cooperative semi-structured deep web is our research object. As entity imformation integration is the important research content of the Deep Web information integration, meeting user damand for integrated retrieval of entity and improve the efficiency of Deep Web integration is our goal,so we study the corresponding data source selection method for semi-structured Deep Web.The work of research include sampling deep web, construct summary, summary update and evaluation strategy. The representation of sampling data of existing methods is not strong.Therefore, taking the characteristics of the entity data of semi-structured deep web into consider, we propose a sampling technique for semi-structured deep web sampling base on random walk tree and graph. For the demand of integrated retrieval,including entity information integration and Entity association mining,we propose a summary construction method base on documents ,feature word and label.Taking above demand characteristics into account, we propose two data source selection methods,the first is based on interest and novelty,the second is base on characteristics transfer relationship of entity. In order to reduce the workload of the update of data source summary,we take the update characteristics of data sources in the same field into account,proposing a summary update strategy base on sampling and entites extended. The study will help to enhance the semi-structured deep web integration efficiency and also has a good application prospects.
英文关键词: Entity;Non-cooperation;Semistructured;Deep Web;Data Source Selection