The web contains countless semi-structured websites, which can be a rich source of information for populating knowledge bases. Existing methods for extracting relations from the DOM trees of semi-structured webpages can achieve high precision and recall only when manual annotations for each website are available. Although there have been efforts to learn extractors from automatically-generated labels, these methods are not sufficiently robust to succeed in settings with complex schemas and information-rich websites. In this paper we present a new method for automatic extraction from semi-structured websites based on distant supervision. We automatically generate training labels by aligning an existing knowledge base with a web page and leveraging the unique structural characteristics of semi-structured websites. We then train a classifier based on the potentially noisy and incomplete labels to predict new relation instances. Our method can compete with annotation-based techniques in the literature in terms of extraction quality. A large-scale experiment on over 400,000 pages from dozens of multi-lingual long-tail websites harvested 1.25 million facts at a precision of 90%.
翻译:网络包含无数的半结构化网站,这些半结构化网站可以成为传播知识基础的丰富信息来源。从半结构化网页的DOM树上提取关系的现有方法可以达到很高的精确度,并且只有在每个网站都有手动说明时才能回顾。虽然已经努力从自动生成的标签中学习提取器,但这些方法不够健全,无法在复杂的结构图和信息丰富的网站环境中取得成功。在本文中,我们提出了一个基于远程监管从半结构化网站自动提取的新方法。我们通过将现有知识库与网页相匹配并利用半结构化网站的独特结构特征自动生成培训标签。然后,我们培训一个基于潜在噪音和不完整标签的分类师,以预测新的关系实例。我们的方法可以与文献中的批注技术在提取质量方面进行竞争。对几十个多语言长尾网站的40多万页进行大规模实验,精确得出了125万个事实,精确值达90%。