Recent years have witnessed great progress on applying pre-trained language models, e.g., BERT, to information retrieval (IR) tasks. Hyperlinks, which are commonly used in Web pages, have been leveraged for designing pre-training objectives. For example, anchor texts of the hyperlinks have been used for simulating queries, thus constructing tremendous query-document pairs for pre-training. However, as a bridge across two web pages, the potential of hyperlinks has not been fully explored. In this work, we focus on modeling the relationship between two documents that are connected by hyperlinks and designing a new pre-training objective for ad-hoc retrieval. Specifically, we categorize the relationships between documents into four groups: no link, unidirectional link, symmetric link, and the most relevant symmetric link. By comparing two documents sampled from adjacent groups, the model can gradually improve its capability of capturing matching signals. We propose a progressive hyperlink predication ({PHP}) framework to explore the utilization of hyperlinks in pre-training. Experimental results on two large-scale ad-hoc retrieval datasets and six question-answering datasets demonstrate its superiority over existing pre-training methods.
翻译:近年来,在对信息检索(IR)任务应用预先培训语言模型(例如,BERT)方面取得了巨大进展。超链接在网页上常用,已被用于设计培训前目标。例如,超链接的锁定文本被用于模拟查询,从而为培训前的查询构建了巨大的查询文件配对。然而,作为两个网页之间的桥梁,超链接的潜力尚未得到充分探索。在这项工作中,我们侧重于对两个通过超链接连接的文档之间的关系进行建模,并设计一个新的培训前检索目标。具体地说,我们将文件之间的关系分为四组:没有链接、单向链接、对称链接和最相关的对称链接。通过比较从邻近群体抽样的两份文件,该模型可以逐步提高其捕捉匹配信号的能力。我们建议采用一个渐进的超链接预设框架,以探索在培训前使用超链接的情况。两个大规模前自动检索数据测试方法的实验结果显示其超标度。