This paper tackles the under-explored problem of DOM element nomination and representation learning with three important contributions. First, we present a large-scale and realistic dataset of webpages, far richer and more diverse than other datasets proposed for element representation learning, classification and nomination on the web. The dataset contains $51,701$ manually labeled product pages from $8,175$ real e-commerce websites. Second, we adapt several Graph Neural Network (GNN) architectures to website DOM trees and benchmark their performance on a diverse set of element nomination tasks using our proposed dataset. In element nomination, a single element on a page is selected for a given class. We show that on our challenging dataset a simple Convolutional GNN outperforms state-of-the-art methods on web element nomination. Finally, we propose a new training method that further boosts the element nomination accuracy. In nomination for the web, classification (assigning a class to a given element) is usually used as a surrogate objective for nomination during training. Our novel training methodology steers the classification objective towards the more complex and useful nomination objective.
翻译:本文用三项重要贡献解决了DOM元素提名和代表性学习的探索不足的问题。 首先,我们展示了与网上元素代表学习、分类和提名提议的其他数据集相比,内容更丰富、更多样化的网页大规模和现实的数据集。 数据集包含51 701美元的人工标签产品页面,来自8 175美元的实际电子商务网站。 第二,我们调整了几个图形神经网络架构,以网站DOM树,并用我们提议的数据集将其业绩以不同元素提名任务作为基准。 在元素提名中,为某个特定类别选择了一个页面上的单个元素。 我们在具有挑战性的数据中显示,一个简单的 Convolutional GNNN优于网上元素提名的状态方法。 最后,我们提出了一个新的培训方法,以进一步提高元素提名的准确性。 在网上提名中,分类(为某个元素指定一个班级)通常用作培训期间的替代提名目标。我们的新的培训方法指导了分类目标,即:更复杂、更有用的提名目标。