Sampling proper negatives from a large document pool is vital to effectively train a dense retrieval model. However, existing negative sampling strategies suffer from the uninformative or false negative problem. In this work, we empirically show that according to the measured relevance scores, the negatives ranked around the positives are generally more informative and less likely to be false negatives. Intuitively, these negatives are not too hard (\emph{may be false negatives}) or too easy (\emph{uninformative}). They are the ambiguous negatives and need more attention during training. Thus, we propose a simple ambiguous negatives sampling method, SimANS, which incorporates a new sampling probability distribution to sample more ambiguous negatives. Extensive experiments on four public and one industry datasets show the effectiveness of our approach. We made the code and models publicly available in \url{https://github.com/microsoft/SimXNS}.
翻译:从大型文件库中抽取适当的底片对于有效培训密集检索模式至关重要。 但是,现有的负面抽样战略存在不知情或虚假的负面问题。 在这项工作中,我们从经验上表明,根据测量的相关性分数,在正数周围排位的底片一般信息量较大,不太可能是虚假的底片。从直觉上看,这些底片并不难(可能是虚假的底片)或过于容易(emph{uninformation}),它们是模糊的底片,在培训期间需要更多注意。因此,我们提出了简单的模糊的底片抽样方法,即SimANS,它包含新的抽样概率分布,以抽样比较模糊的底片。对四个公共数据和一个行业数据集进行的广泛实验显示了我们的方法的有效性。我们在\url{https://github.com/microsoft/SimXNS}中公布了代码和模型。