Existing techniques for efficiently crawling social media sites rely on URL patterns, query logs, and human supervision. This paper describes SOUrCe, a structure-oriented unsupervised crawler that uses page structures to learn how to crawl a social media site efficiently. SOUrCe consists of two stages. During its unsupervised learning phase, SOUrCe constructs a sitemap that clusters pages based on their structural similarity and generates a navigation table that describes how the different types of pages in the site are linked together. During its harvesting phase, it uses the navigation table and a crawling policy to guide the choice of which links to crawl next. Experiments show that this architecture supports different styles of crawling efficiently, and does a better job of staying focused on user-created contents than baseline methods.
翻译:高效爬行社交媒体网站的现有技术依赖于 URL 模式、 查询日志和人文监督。 本文描述SOUrCe, 是一个结构导向型、 不受监督的爬行者, 使用页面结构来学习如何有效爬行社交媒体网站。 SOUrCe 由两个阶段组成。 在未受监督的学习阶段, SOUrCe 构建了一个基于结构相似性的集成页面的站点图, 并生成一个导航表, 描述网站不同类型页面是如何连接在一起的。 在采集阶段, 它使用导航表和爬行政策来指导选择下一个链接。 实验显示, 这一架构支持了不同爬行风格, 并且比基线方法更好地关注用户创建的内容。