Pre-trained language model (PTM) has been shown to yield powerful text representations for dense passage retrieval task. The Masked Language Modeling (MLM) is a major sub-task of the pre-training process. However, we found that the conventional random masking strategy tend to select a large number of tokens that have limited effect on the passage retrieval task (e,g. stop-words and punctuation). By noticing the term importance weight can provide valuable information for passage retrieval, we hereby propose alternative retrieval oriented masking (dubbed as ROM) strategy where more important tokens will have a higher probability of being masked out, to capture this straightforward yet essential information to facilitate the language model pre-training process. Notably, the proposed new token masking method will not change the architecture and learning objective of original PTM. Our experiments verify that the proposed ROM enables term importance information to help language model pre-training thus achieving better performance on multiple passage retrieval benchmarks.
翻译:培训前语言模型(PTM)被证明能为密集通道检索任务产生强大的文字表示力。 蒙面语言模型(MLM)是培训前进程的一个主要子任务。 然而,我们发现常规随机掩码战略倾向于选择对通道检索任务影响有限的大量象征性物(例如中继词和标点),通过注意到该术语的重要性重量可以为通道检索提供宝贵的信息,我们在此建议采用其他以检索为导向的掩码(作为ROM)战略,其中更重要的符号更有可能被遮盖出来,以捕捉便利语言模型培训前进程这一直截了当但必不可少的信息。值得注意的是,拟议的新的代号掩码方法不会改变原始PTM的架构和学习目标。我们的实验证实,拟议的ROM使术语重要信息能够帮助语言模型的预培训,从而在多个通道检索基准上取得更好的业绩。