Webpage information extraction (WIE) is an important step to create knowledge bases. For this, classical WIE methods leverage the Document Object Model (DOM) tree of a website. However, use of the DOM tree poses significant challenges as context and appearance are encoded in an abstract manner. To address this challenge we propose to reformulate WIE as a context-aware Webpage Object Detection task. Specifically, we develop a Context-aware Visual Attention-based (CoVA) detection pipeline which combines appearance features with syntactical structure from the DOM tree. To study the approach we collect a new large-scale dataset of e-commerce websites for which we manually annotate every web element with four labels: product price, product title, product image and background. On this dataset we show that the proposed CoVA approach is a new challenging baseline which improves upon prior state-of-the-art methods.
翻译:网页信息提取( WIE) 是创建知识基础的重要一步。 在这方面, 经典 WIE 方法利用网站的文档对象模型树。 但是, 使用 DOM 树带来了重大挑战, 因为背景和外观是以抽象的方式编码的。 为了应对这一挑战, 我们提议重新将 WIE 重新配置为一种有上下文的网页对象探测任务。 具体地说, 我们开发了一种基于环境觉悟的视觉注意( CoVA) 检测管道, 将 DOM 树的外观特征和综合技术结构结合起来。 研究一种方法, 我们收集一个新的大型电子商务网站数据集, 我们手动用四个标签( 产品价格、 产品产权、 产品图像和背景) 来说明每个网络要素。 在此数据集中, 我们显示, 拟议的COVA 方法是一个具有挑战性的新基线, 改进了先前的最新方法 。