Neural retrieval models are often trained on (subsets of) the millions of queries of the MS MARCO / ORCAS datasets and then tested on the 250 Robust04 queries or other TREC benchmarks with often only 50 queries. In such setups, many of the few test queries can be very similar to queries from the huge training data -- in fact, 69% of the Robust04 queries have near-duplicates in MS MARCO / ORCAS. We investigate the impact of this unintended train-test leakage by training neural retrieval models on combinations of a fixed number of MS MARCO / ORCAS queries that are highly similar to the actual test queries and an increasing number of other queries. We find that leakage can improve effectiveness and even change the ranking of systems. However, these effects diminish as the amount of leakage among all training instances decreases and thus becomes more realistic.
翻译:神经检索模型往往(分数)就MS MARCO/ORCAS数据集的数百万次查询进行培训,然后用250次Robust04查询或其他TREC基准进行测试,往往只有50次查询。在这种设置中,少数几个测试查询可能与大量培训数据的查询非常相似 -- -- 事实上,69%的测试查询在MS MARCO/ORCAS中接近重复。我们通过培训神经检索模型对固定数目的MS MARCO/ORCAS查询进行合并来调查这种无意的训练试验渗漏的影响,这些查询与实际测试查询和越来越多的其他查询非常相似。我们发现,渗漏可以提高效力,甚至改变系统的等级。但是,随着所有培训案例的渗漏量减少,这些影响会减少,因此变得更为现实。