Recently, neural models have been leveraged to significantly improve the performance of information extraction from semi-structured websites. However, a barrier for continued progress is the small number of datasets large enough to train these models. In this work, we introduce the PLAtE (Pages of Lists Attribute Extraction) dataset as a challenging new web extraction task. PLAtE focuses on shopping data, specifically extractions from product review pages with multiple items. PLAtE encompasses both the tasks of: (1) finding product-list segmentation boundaries and (2) extracting attributes for each product. PLAtE is composed of 53, 905 items from 6, 810 pages, making it the first large-scale list page web extraction dataset. We construct PLAtE by collecting list pages from Common Crawl, then annotating them on Mechanical Turk. Quantitative and qualitative analyses are performed to demonstrate PLAtE has high-quality annotations. We establish strong baseline performance on PLAtE with a SOTA model achieving an F1-score of 0.750 for attribute classification and 0.915 for segmentation, indicating opportunities for future research innovations in web extraction.
翻译:最近,利用神经模型大大改进了从半结构化网站提取信息的业绩,然而,持续进展的障碍是数量不多的数据集,数量之大,足以培训这些模型。在这项工作中,我们引入PLAtE(清单属性提取图示)数据集,这是一项具有挑战性的新的网络提取任务。PLAtE侧重于购物数据,具体而言,是从多项产品的产品审查网页上提取的数据。PLAtE包含以下两项任务:(1) 寻找产品清单分割界限和(2) 提取每种产品的属性。PLAtE由53,905个物品组成,6,810页,使其成为第一个大型清单网页提取数据集。我们通过收集通用Crawal列表页面,然后在机械化土库曼上进行批注,构建PLAtE。进行定量和定性分析,以显示PLAtE具有高质量的说明。我们为PLAtE建立了强有力的基线性能,SOTA模型的F1-芯迹为0.750,分类为0.915。我们通过收集未来在网络提取过程中进行创新的机会。