Automatic extraction of forum posts and metadata is a crucial but challenging task since forums do not expose their content in a standardized structure. Content extraction methods, therefore, often need customizations such as adaptations to page templates and improvements of their extraction code before they can be deployed to new forums. Most of the current solutions are also built for the more general case of content extraction from web pages and lack key features important for understanding forum content such as the identification of author metadata and information on the thread structure. This paper, therefore, presents a method that determines the XPath of forum posts, eliminating incorrect mergers and splits of the extracted posts that were common in systems from the previous generation. Based on the individual posts further metadata such as authors, forum URL and structure are extracted. We also introduce Harvest, a new open source toolkit that implements the presented methods and create a gold standard extracted from 52 different Web forums for evaluating our approach. A comprehensive evaluation reveals that Harvest clearly outperforms competing systems.
翻译:自动提取论坛员额和元数据是一项关键但具有挑战性的任务,因为论坛没有在标准化结构中披露其内容,因此,内容提取方法往往需要定制化,例如调整页面模板和改进提取代码,然后才能将其部署到新的论坛,目前的大多数解决方案也是为更一般的从网页中提取内容而建立的,缺乏了解论坛内容的重要特征,例如确定作者元数据以及线索结构信息。因此,本文件提出了一种确定论坛员额XPath的方法,消除了前一代系统中常见的不正确合并和分离。基于单个员额进一步提取元数据,如作者、论坛URL和结构。我们还引入了一个新的开放源工具包,即实施所提出的方法,并创建52个不同网络论坛中提取的黄金标准,用于评估我们的方法。全面评价显示,收获显然超越了相互竞争的系统。