Finetuning Pretrained Language Models (PLM) for IR has been de facto the standard practice since their breakthrough effectiveness few years ago. But, is this approach well understood? In this paper, we study the impact of the pretraining collection on the final IR effectiveness. In particular, we challenge the current hypothesis that PLM shall be trained on a large enough generic collection and we show that pretraining from scratch on the collection of interest is surprisingly competitive with the current approach. We benchmark first-stage ranking rankers and cross-encoders for reranking on the task of general passage retrieval on MSMARCO, Mr-Tydi for Arabic, Japanese and Russian, and TripClick for specific domain. Contrary to popular belief, we show that, for finetuning first-stage rankers, models pretrained solely on their collection have equivalent or better effectiveness compared to more general models. However, there is a slight effectiveness drop for rerankers pretrained only on the target collection. Overall, our study sheds a new light on the role of the pretraining collection and should make our community ponder on building specialized models by pretraining from scratch. Last but not least, doing so could enable better control of efficiency, data bias and replicability, which are key research questions for the IR community.
翻译:自几年前取得突破性效果以来,IRS的微调预先语言模型(PLM)实际上已经成为自几年前取得突破性效果以来的标准做法。但是,这一方法是否被很好地理解?我们在本文件中研究了培训前收集对IR最后效果的影响。特别是,我们质疑目前的假设,即PLM应当在一个足够大的通用收藏中接受培训,而我们显示,从头到尾收集兴趣的预培训与目前的方法相比具有惊人的竞争力。我们为在MSMARCO、阿拉伯语、日语和俄语的Tydi先生和特定域的TripClick上重新排名普通通道检索任务的第一等级等级的排名员和交叉编码员提供了基准。我们发现,与流行的信念相反,我们显示,为了对第一阶段的排名员进行微调,仅对其收集进行预先培训的模型的模型与较一般的模型相比具有同等或更好的效果。然而,对于只对目标收集培训前接受过培训的重新排名者来说,我们的研究对培训前收集工作的作用作了新的描述,并且应该使我们的社区通过从头到头进行专门模型的思考。最后但并非最不重要的、最起码的、最关键的是使研究能够对社区进行更好的控制。