预测重点集中爬行的新外包 (Prediction of new outlinks for focused crawling)

Discovering new hyperlinks enables Web crawlers to find new pages that have not yet been indexed. This is especially important for focused crawlers because they strive to provide a comprehensive analysis of specific parts of the Web, thus prioritizing discovery of new pages over discovery of changes in content. In the literature, changes in hyperlinks and content have been usually considered simultaneously. However, there is also evidence suggesting that these two types of changes are not necessarily related. Moreover, many studies about predicting changes assume that long history of a page is available, which is unattainable in practice. The aim of this work is to provide a methodology for detecting new links effectively using a short history. To this end, we use a dataset of ten crawls at intervals of one week. Our study consists of three parts. First, we obtain insight in the data by analyzing empirical properties of the number of new outlinks. We observe that these properties are, on average, stable over time, but there is a large difference between emergence of hyperlinks towards pages within and outside the domain of a target page (internal and external outlinks, respectively). Next, we provide statistical models for three targets: the link change rate, the presence of new links, and the number of new links. These models include the features used earlier in the literature, as well as new features introduced in this work. We analyze correlation between the features, and investigate their informativeness. A notable finding is that, if the history of the target page is not available, then our new features, that represent the history of related pages, are most predictive for new links in the target page. Finally, we propose ranking methods as guidelines for focused crawlers to efficiently discover new pages, which achieve excellent performance with respect to the corresponding targets.

翻译：发现新的超链接后, 网络爬行者可以找到尚未索引的新页面。这对于重点爬行者尤为重要, 因为他们努力提供对网络特定部分的全面分析, 从而将发现新页面的时间比发现内容的变化更为优先。在文献中, 超链接和内容的变化通常被同时考虑。但是, 也有证据表明, 这两类变化并不一定是相互关联的。此外, 许多关于预测变化的研究假设, 一个页面有很长的历史, 在实践中是无法实现的。这项工作的目的是提供一种方法, 以便利用短历史有效探测新链接。为此, 我们使用10个爬行的数据集, 间隔一周的时间间隔。我们的研究包括三个部分。首先, 我们通过分析新链接数量的经验属性来了解数据中的洞察力。我们观察到, 这些属性在平均来说是稳定的, 但是在目标页面( 内部和外部的链接, 分别是) 目的是提供一种方法, 有效检测新的链接。下一步, 我们提供一个统计模型, 在三个目标的间隔时间段中, 显示一个最明显的链接的链接, 最后的链接是历史特征。我们用来分析新链接的排序。

相关内容

网络爬虫

关注 13

网络爬虫（又被称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常被称为网页追逐者），是一种按照一定的规则，自动的抓取万维网信息的程序或者脚本，已被广泛应用于互联网领域。搜索引擎使用网络爬虫抓取Web网页、文档甚至图片、音频、视频等资源，通过相应的索引技术组织这些信息，提供给搜索用户进行查询。网络爬虫也为中小站点的推广提供了有效的途径。