关于从Scratch到IR的预培训变压器的实验研究 (An Experimental Study on Pretraining Transformers from Scratch for IR)

Finetuning Pretrained Language Models (PLM) for IR has been de facto the standard practice since their breakthrough effectiveness few years ago. But, is this approach well understood? In this paper, we study the impact of the pretraining collection on the final IR effectiveness. In particular, we challenge the current hypothesis that PLM shall be trained on a large enough generic collection and we show that pretraining from scratch on the collection of interest is surprisingly competitive with the current approach. We benchmark first-stage ranking rankers and cross-encoders for reranking on the task of general passage retrieval on MSMARCO, Mr-Tydi for Arabic, Japanese and Russian, and TripClick for specific domain. Contrary to popular belief, we show that, for finetuning first-stage rankers, models pretrained solely on their collection have equivalent or better effectiveness compared to more general models. However, there is a slight effectiveness drop for rerankers pretrained only on the target collection. Overall, our study sheds a new light on the role of the pretraining collection and should make our community ponder on building specialized models by pretraining from scratch. Last but not least, doing so could enable better control of efficiency, data bias and replicability, which are key research questions for the IR community.

翻译：自几年前取得突破性效果以来,IRS的微调预先语言模型(PLM)实际上已经成为自几年前取得突破性效果以来的标准做法。但是,这一方法是否被很好地理解?我们在本文件中研究了培训前收集对IR最后效果的影响。特别是,我们质疑目前的假设,即PLM应当在一个足够大的通用收藏中接受培训,而我们显示,从头到尾收集兴趣的预培训与目前的方法相比具有惊人的竞争力。我们为在MSMARCO、阿拉伯语、日语和俄语的Tydi先生和特定域的TripClick上重新排名普通通道检索任务的第一等级等级的排名员和交叉编码员提供了基准。我们发现,与流行的信念相反,我们显示,为了对第一阶段的排名员进行微调,仅对其收集进行预先培训的模型的模型与较一般的模型相比具有同等或更好的效果。然而,对于只对目标收集培训前接受过培训的重新排名者来说,我们的研究对培训前收集工作的作用作了新的描述,并且应该使我们的社区通过从头到头进行专门模型的思考。最后但并非最不重要的、最起码的、最关键的是使研究能够对社区进行更好的控制。

相关内容

关注 14

信息检索杂志（IR）为信息检索的广泛领域中的理论、算法分析和实验的发布提供了一个国际论坛。感兴趣的主题包括对应用程序（例如Web，社交和流媒体，推荐系统和文本档案）的搜索、索引、分析和评估。这包括对搜索中人为因素的研究、桥接人工智能和信息检索以及特定领域的搜索应用程序。官网地址：https://dblp.uni-trier.de/db/journals/ir/

最新《Transformers模型》教程，64页ppt

专知会员服务

320+阅读 · 2020年11月26日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

【医学图像处理中的因果性】52页ppt，Causality Matters in Medical Imaging

专知会员服务

60+阅读 · 2020年3月14日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日