Structure information extraction refers to the task of extracting structured text fields from web pages, such as extracting a product offer from a shopping page including product title, description, brand and price. It is an important research topic which has been widely studied in document understanding and web search. Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction. However, effectively serializing tokens from unstructured web pages is challenging in practice due to a variety of web layout patterns. Limited work has focused on modeling the web layout for extracting the text fields. In this paper, we introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents. First, we design HTML tokens for each DOM node in the HTML by embedding representations from their neighboring tokens through graph attention. Second, we construct rich attention patterns between HTML tokens and text tokens, which leverages the web layout for effective attention weight computation. We conduct an extensive set of experiments on SWDE and Common Crawl benchmarks. Experimental results demonstrate the superior performance of the proposed approach over several state-of-the-art methods.
翻译:结构信息提取是指从网页上提取结构化文本字段的任务,例如从购物页面中提取产品,包括产品标题、描述、品牌和价格。这是一个重要的研究专题,在文件理解和网络搜索中已对此进行了广泛研究。最近自然语言模型的序列建模展示了网络信息提取方面的最先进的表现。然而,由于各种网络布局模式,实际上将非结构化网页上的标牌进行序列化是困难的。有限的工作侧重于为提取文本字段建立网络布局的模型。在本文件中,我们采用了WebFormer,一个用于从网络文档中提取结构信息的结构化图样。首先,我们设计了HTML中每个DOM节点的HTML标志,通过图示关注从其邻近符号中嵌入代表。第二,我们构建了HTML符号和文本标牌之间的大量关注模式,利用网络布局进行有效的注意权重计算。我们就SWDE和通用Crawl基准进行了广泛的实验。实验结果表明,拟议的方法优于若干州方法。