Large pre-trained neural networks are ubiquitous and critical to the success of many downstream tasks in natural language processing and computer vision. However, within the field of web information retrieval, there is a stark contrast in the lack of similarly flexible and powerful pre-trained models that can properly parse webpages. Consequently, we believe that common machine learning tasks like content extraction and information mining from webpages have low-hanging gains that yet remain untapped. We aim to close the gap by introducing an agnostic deep graph neural network feature extractor that can ingest webpage structures, pre-train self-supervised on massive unlabeled data, and fine-tune to arbitrary tasks on webpages effectually. Finally, we show that our pre-trained model achieves state-of-the-art results using multiple datasets on two very different benchmarks: webpage boilerplate removal and genre classification, thus lending support to its potential application in diverse downstream tasks.
翻译:受过训练的大型神经网络无处不在,对于自然语言处理和计算机视觉方面许多下游任务的成功至关重要,然而,在网络信息检索领域,缺乏同样灵活和强大的、经过训练的、能够适当分析网页的先进模型,形成了鲜明的对比。因此,我们认为,像内容提取和网页信息挖掘等共同的机器学习任务具有低档成果,但仍有待开发。我们的目标是通过引入一个不可知的深层图形神经网络特征提取器来缩小差距,该提取器可以吸收网页结构,对大量无标签数据进行预先自我监督,对网页的任意任务进行微调。 最后,我们表明,我们经过训练的模型利用两个非常不同的基准的多个数据集,即网页锅板去除和基因分类,取得了最新的成果,从而支持其在各种下游任务中的潜在应用。