There has been a steady need to precisely extract structured knowledge from the web (i.e. HTML documents). Given a web page, extracting a structured object along with various attributes of interest (e.g. price, publisher, author, and genre for a book) can facilitate a variety of downstream applications such as large-scale knowledge base construction, e-commerce product search, and personalized recommendation. Considering each web page is rendered from an HTML DOM tree, existing approaches formulate the problem as a DOM tree node tagging task. However, they either rely on computationally expensive visual feature engineering or are incapable of modeling the relationship among the tree nodes. In this paper, we propose a novel transferable method, Simplified DOM Trees for Attribute Extraction (SimpDOM), to tackle the problem by efficiently retrieving useful context for each node by leveraging the tree structure. We study two challenging experimental settings: (i) intra-vertical few-shot extraction, and (ii) cross-vertical fewshot extraction with out-of-domain knowledge, to evaluate our approach. Extensive experiments on the SWDE public dataset show that SimpDOM outperforms the state-of-the-art (SOTA) method by 1.44% on the F1 score. We also find that utilizing knowledge from a different vertical (cross-vertical extraction) is surprisingly useful and helps beat the SOTA by a further 1.37%.
翻译:一直需要准确地从网络中提取结构化知识(即 HTML 文件) 。 在一个网页上, 提取结构化对象以及各种感兴趣的属性( 如价格、 出版商、 作者和书的版本) 能够促进各种下游应用, 如大规模知识基础建设、 电子商务产品搜索、 个性化建议等。 考虑到每个网页都是 HTML DOM 树制作的, 现有方法将问题写成 DOM 树节点标记任务。 但是, 它们要么依靠计算成本昂贵的视觉特征工程, 要么无法模拟树节点之间的关系。 在本文中, 我们提出了一个创新的可转让方法, 简化 DOM 树用于属性提取( SimpDOM ), 通过利用树结构为每个节点有效地重新定位有用环境来解决这个问题。 我们研究了两个挑战性的实验环境:(一) 垂直内几发式提取, 和(二) 由外部知识的反向几发式提取, 来评估我们的方法。 (SWDE- TA 1) 利用SO- TRA 的直径直径直径方法, 将SO- droptal- frack 用于SOM_ 。