Recently pre-trained language representation models such as BERT have shown great success when fine-tuned on downstream tasks including information retrieval (IR). However, pre-training objectives tailored for ad-hoc retrieval have not been well explored. In this paper, we propose Pre-training with Representative wOrds Prediction (PROP) for ad-hoc retrieval. PROP is inspired by the classical statistical language model for IR, specifically the query likelihood model, which assumes that the query is generated as the piece of text representative of the "ideal" document. Based on this idea, we construct the representative words prediction (ROP) task for pre-training. Given an input document, we sample a pair of word sets according to the document language model, where the set with higher likelihood is deemed as more representative of the document. We then pre-train the Transformer model to predict the pairwise preference between the two word sets, jointly with the Masked Language Model (MLM) objective. By further fine-tuning on a variety of representative downstream ad-hoc retrieval tasks, PROP achieves significant improvements over baselines without pre-training or with other pre-training methods. We also show that PROP can achieve exciting performance under both the zero- and low-resource IR settings. The code and pre-trained models are available at https://github.com/Albert-Ma/PROP.
翻译:在对下游任务,包括信息检索(IR)进行微调时,诸如BERT等最近经过培训前的语言代表模式取得了巨大成功。然而,没有很好地探讨为临时检索而专门设计的训练前目标。在本文件中,我们提议为临时检索而与代表Ords Surveillion(PROP)进行预先培训。PROP的灵感来自国际资源局古典统计语言模式,特别是查询可能性模式,该模式假定查询是作为“理想”文件的文本代表物生成的。基于这一想法,我们为培训前设计了有代表性的字词预测任务。根据一个投入文件,我们根据文件语言模式抽样了一对词组,其中将更有可能的数据集视为文件的样本。我们随后对变换模型进行了预先调整,以预测两种词组之间的配对偏好,同时结合“遮掩语言模式”的目标。通过进一步微调具有代表性的下游/组合检索任务,PROPL在基线上取得了重大改进,而无需事先培训,或采用其他具有兴奋性的文件模式。我们还显示在前/RPRODRDR的操作前的绩效。我们还可以实现。