项目名称: 基于概念背景图的网络爬虫语义协作与竞争策略研究
项目编号: No.61271413
项目类型: 面上项目
立项/批准年度: 2013
项目学科: 无线电电子学、电信技术
项目作者: 杜亚军
作者单位: 西华大学
项目金额: 70万元
中文摘要: 在多个Web主题爬虫并行爬行中,如何避免重复访问网页和高效地获取与主题相关网页,成为搜索引擎主题爬行的热点研究之一。为了完成系统爬行任务和充分发挥每个爬虫自身能力,本申请项目立足于每个爬虫相对独立爬行、共同协作、彼此竞争的思想,将爬虫的历史爬行网页作为背景知识,分析这些网页文本内容,提取网页的概念和概念间的语义关系,探讨不同爬行虫背景知识之间的语义相似性,提出基于分层概念背景图的爬虫之间理解方法、协作和竞争策略。重点研究四个方面的内容:1)主题爬虫背景知识的分层概念背景图的表示模型。2)基于分层概念背景图的爬虫语义理解方法。3)在语义理解模型下同组多个网络爬虫之间协作与竞争机制及实现。4)在语义理解模型下异组多个爬虫之间协作与竞争机制及实现。通过研究预期获得一套多网络爬虫相互理解、协作、竞争的信息获取的新思想、新方法、新技术、新系统。因此本项目研究具有着重要的理论意义和广阔的应用前景。
中文关键词: 多Agent系统;主题网络爬行虫;概念背景图;协作与竞争;信息检索
英文摘要: In focused cralwing system, multi-crawlers crawl parallelly Web and download Web pages. It is one of hotspot research of search engine how the different focused crawlers avoid to visit the same URLs and they download efficiently Web pages related to the search topic. In order to accomplish rapidly the crawling tasks of the system for the specific topic, and embody fully every Web crawler's ability, we consider that these history visiting Web pages (URLs) of every focused crawler reflect their backgroup knowledge. On basis of cralwing independently, collaborating togather and competing with each other for Web crawlers of the system, we propose the novel understanding, cooperating and competing strategy of concept context graph by analyzing these Web page's content, extracting semantic features- - concepts of these Web pages in history collects of every Web crawlers as their backgroup knowledge and studing the semantic relationships of their backgroup knowledge. Our mainly researches are listed as follows: 1).Constructing the mathematical model of backgrounp knowledge of every Web crawler based on hierarchy concept context graph, according to the semantic characteristics- - concepts of Web pages and their semantic relationships among the concepts. 2).Studying the understanding method and model among Web crawler
英文关键词: Muti-Agent System;Focused Web Crawler;Concept Context Graph;Cooprtating and Competing;Information Retrieve