Extracting structured information from HTML documents is a long-studied problem with a broad range of applications, including knowledge base construction, faceted search, and personalized recommendation. Prior works rely on a few human-labeled web pages from each target website or thousands of human-labeled web pages from some seed websites to train a transferable extraction model that generalizes on unseen target websites. Noisy content, low site-level consistency, and lack of inter-annotator agreement make labeling web pages a time-consuming and expensive ordeal. We develop LEAST -- a Label-Efficient Self-Training method for Semi-Structured Web Documents to overcome these limitations. LEAST utilizes a few human-labeled pages to pseudo-annotate a large number of unlabeled web pages from the target vertical. It trains a transferable web-extraction model on both human-labeled and pseudo-labeled samples using self-training. To mitigate error propagation due to noisy training samples, LEAST re-weights each training sample based on its estimated label accuracy and incorporates it in training. To the best of our knowledge, this is the first work to propose end-to-end training for transferable web extraction models utilizing only a few human-labeled pages. Experiments on a large-scale public dataset show that using less than ten human-labeled pages from each seed website for training, a LEAST-trained model outperforms previous state-of-the-art by more than 26 average F1 points on unseen websites, reducing the number of human-labeled pages to achieve similar performance by more than 10x.
翻译:从 HTML 文件中提取结构化信息是一个长期研究的问题,其应用范围很广,包括知识基础建设、面对面搜索和个性化建议。以前的工作依靠每个目标网站的几张人类标签网页或从某些种子网站的数千个人类标签网页来培训可转移的提取模型,该模型在无形目标网站进行概括化的可转移的提取模型。在人类标签和伪标签样本上,它用自我训练来培训一个可转移的网络扩展模型。为了减少标签标签标签的传播模式,我们开发了LEART -- -- 半结构网络文档的一种拉贝尔-高效自我培训方法,以克服这些限制。LEART利用几个人类标签网页的几张人类标签网页,从我们的知识中找出几个人类标签标签的几张假标签网页, 利用人类平均标签的每个模型, 利用人类平均标签的每张样本, 将人类平均版本的每张标本的服务器, 利用人类平均版本的服务器, 将人类平均版本的每张标本, 仅利用人类平均版本的服务器, 将人类平均版本的每张模型, 将人类标准的服务器, 将人类平均版本的每张模型用最普通的标签, 向最高级的服务器, 将人类标准的页 向最高级的版本的纸级的版本的版本的版本的版本的版本的服务器, 向最高级的版本的版本的版本,只用在最高级的版本的版本的版本的版本的网页, 展示的服务器, 向最高级的版本的版本 展示的服务器,只用在人类标签, 向最高级的版本的版本的版本的版本的网页上, 。