通用Web结构化信息检索引擎的关键技术研究

项目名称： 通用Web结构化信息检索引擎的关键技术研究

项目编号： No.61202331

项目类型： 青年科学基金项目

立项/批准年度： 2013

项目学科： 计算机科学学科

项目作者： 王秋月

作者单位： 中国人民大学

项目金额： 25万元

中文摘要： Web上存在大量的结构化数据，大多处于深层Web数据库中，很难被一般的搜索引擎所检索。如何利用这些数据来满足用户的信息需求是长期以来的一个研究问题。早期的数据集成法更适合特定领域的垂直搜索引擎。近年来通用搜索引擎中使用的数据抓取法是预计算出尽可能多的相关页面，但由于这些页面被作为普通HTML页面一样被索引和检索，因而无法利用数据中原有的结构信息来改善检索效果。针对两者的主要缺陷，本项目的研究目标是研制一个跨领域的通用Web结构化信息检索引擎，既能在检索中充分利用结构信息，又是跨领域和通用的。具体地，我们将数据集成法和数据抓取法两者的优势结合在一起，并使用新的基于语言模型的结构化信息检索模型，充分利用数据和查询中的结构信息改善检索效果，并能将结构化数据和非结构化数据的检索统一在一个系统中，从而使得通用搜索引擎即时搜索Web上的各种结构的和非结构的数据成为可能。

中文关键词： Web结构化数据；深层网；联邦搜索；数据源选择；结构化关键词查询

英文摘要： The Web saw an increasing amount of structured data. Most of them are buried in Deep Web, and can hardly be indexed or searched by search engines. How to make use of such data to satisfy users' information needs has attracted many researchers' attention since the early of this century. There are two fundamentally different approaches to incorporating the deep web into search engines: data integration and deep web crawling or surfacing. Data integration approach is more suitable for building vertical search engines in a specific domain. Deep web surfacing approach, which is adopted by the current general-purpose search engines, like Google, is to pre-compute as many dynamic pages as possible from the underlying structured database, then index and search these pages with other static HTML pages on the Web. However, the original structural information in data are thus ignored during search. In this project, we target at building a general-purpose search engine for structured data on the Web, which can exploit the structural information in data to improve search performance, and also it is a general-purpose one in the sense that it can search all the structured data on the Web in various domains. In particular, we take advantages of both data integration and deep web surfacing approaches. We employ our newly studied

英文关键词： structured data on the Web；Deep Web；federated search；data source selection；structured keyword query

成为VIP会员查看完整内容