A web crawler is a system designed to collect web pages, and efficient crawling of new pages requires appropriate algorithms. While website features such as XML sitemaps and the frequency of past page updates provide important clues for accessing new pages, their universal application across diverse conditions is challenging. In this study, we propose a method to efficiently collect new pages by classifying web pages into two types, "Index Pages" and "Content Pages," using a large language model (LLM), and leveraging the classification results to select index pages as starting points for accessing new pages. We construct a dataset with automatically annotated web page types and evaluate our approach from two perspectives: the page type classification performance and coverage of new pages. Experimental results demonstrate that the LLM-based method outperformed baseline methods in both evaluation metrics.
翻译:网络爬虫是一种旨在收集网页的系统,高效爬取新页面需要合适的算法。虽然XML站点地图和过往页面更新频率等网站特征为访问新页面提供了重要线索,但其在不同条件下的普适性应用仍具挑战性。本研究提出一种方法,通过使用大型语言模型(LLM)将网页分类为“索引页”和“内容页”两种类型,并利用分类结果选择索引页作为访问新页面的起点,从而高效收集新页面。我们构建了一个自动标注网页类型的数据集,并从页面类型分类性能和新页面覆盖率两个角度评估了所提方法。实验结果表明,基于LLM的方法在两项评估指标上均优于基线方法。