In this paper, we propose SimLM (Similarity matching with Language Model pre-training), a simple yet effective pre-training method for dense passage retrieval. It employs a simple bottleneck architecture that learns to compress the passage information into a dense vector through self-supervised pre-training. We use a replaced language modeling objective, which is inspired by ELECTRA, to improve the sample efficiency and reduce the mismatch of the input distribution between pre-training and fine-tuning. SimLM only requires access to unlabeled corpus, and is more broadly applicable when there are no labeled data or queries. We conduct experiments on several large-scale passage retrieval datasets, and show substantial improvements over strong baselines under various settings. Remarkably, SimLM even outperforms multi-vector approaches such as ColBERTv2 which incurs significantly more storage cost.
翻译:在本文中,我们提出SimLM(与语言模型相匹配的语言模型预培训),这是一种简单而有效的密集通道检索前培训方法。它使用一个简单的瓶颈结构,通过自我监督的预培训,学会将通道信息压缩到密度矢量中。我们使用由ELECTRA启发的替代语言模型目标,以提高样本效率,减少培训前和微调之间输入分布不匹配的情况。SimLM只要求获得无标签的文体,并且在没有标签的数据或查询的情况下可以更广泛地应用。我们就几个大型通道检索数据集进行实验,并表明在各种环境下的强基线上有很大的改进。值得注意的是,SimLM甚至超越了ColBERTv2等多动体方法,这需要花费更多的存储费用。