Template detection and content extraction are two of the main areas of information retrieval applied to the Web. They perform different analyses over the structure and content of webpages to extract some part of the document. However, their objective is different. While template detection identifies the template of a webpage (usually comparing with other webpages of the same website), content extraction identifies the main content of the webpage discarding the other part. Therefore, they are somehow complementary, because the main content is not part of the template. It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks because templates usually contain irrelevant information such as advertisements, menus and banners. Processing and storing this information is likely to lead to a waste of resources (storage space, bandwidth, etc.). Similarly, identifying the main content is essential for many information retrieval tasks. In this paper, we present a benchmark suite to test different approaches for template detection and content extraction. The suite is public, and it contains real heterogeneous webpages that have been labelled so that different techniques can be suitable (and automatically) compared.
翻译:模板检测和内容提取是应用到网络的信息检索的两个主要领域。 它们对网页的结构和内容进行不同分析, 以提取文件的某些部分。 但是, 它们的目标不同 。 虽然模板检测确定了网页模板( 通常与同一网站的其他网页比较), 但内容提取确定了网页中丢弃另一部分的主要内容。 因此, 它们在某种程度上是互补的, 因为主要内容不是模板的一部分。 测量到模板代表了40%至50%的网上数据。 因此, 确定模板对于任务索引化至关重要, 因为模板通常包含不相关的信息, 如广告、菜单和横幅。 处理和存储这些信息可能会导致资源浪费( 储存空间、 带宽等 ) 。 同样, 确定主要内容对于许多信息检索任务至关重要 。 在本文中, 我们提出了一个基准套件, 测试模板检测和内容提取的不同方法。 套件是公开的, 它包含真实的混杂的网页, 贴有标签, 以便不同的技术可以( 自动) 比较 。